In these notes we extend the inference ideas from the previous two classes to
the relationship between two continuous variables.
Two continuous variables
We have seen how to use the lm_basic function to fit models for the
mean of some response. We have used both a single mean for the entire
dataset as well as multiple means based on a second categorical variable.
What happens if we use the same set-up but instead use a numeric variable
to predict the value of some response? The output is surprisingly similar,
but the interpretation of the results differ slightly.
As an example, let’s predict the amount each mammal is awake as a function
of its brain weight:
There is once again an intercept term and a row of the table corresponding
to the new variable brainwt. What do these numbers mean? It turns out
that this is simply describing a best-fit line through the data. We have
already seen how to do this graphically with geom_smooth. The line here
is, exactly, the line given in this plot:
The reg_table function is just giving us the intercept and slope of
this line, along with confidence interval bounds for both. Does it make
sense that the slope here is negative? It should!
This should explain why the first term is called the intercept. As with
the discrete case, there is a special meaning behind whether the confidence
interval contains zero. If it does not, we say we have detected a
statistically significant linear relationship between our two variables.
Multiple linear regression
Further, and finally, we can add multiple variables into a single regression.
It is even possible to mix continuous and categorical variables into the
The interpretation becomes, in this case, the change we would expect to
see in the response given a marginal change in one of the explanatory
variables on the right-hand side of the model. That is, how do we expect
the mean to change if we modify one (and only one) of the other variables.
We could spend a lot of time focusing on this distinction, but I don’t want
to go too far down this line of thinking.
As models become more complex it can become difficult to directly compute
the predicted values and residuals that come from it. We can use the
add_prediction function in R to append the residuals back into the original
It also adds in the predicted values for each data point. These predictions
are often called the fitted values. We can see what this is doing with
a simple plot:
As you can see, the predicted values all line along a single line.
We will work on the next lab for the remainder of the class:
Please upload your script to GitHub ahead of the next class.