Class 20: Linear Regression

Objectives

In these notes we extend the inference ideas from the previous two classes to the relationship between two continuous variables.

Two continuous variables

We have seen how to use the lm_basic function to fit models for the mean of some response. We have used both a single mean for the entire dataset as well as multiple means based on a second categorical variable. What happens if we use the same set-up but instead use a numeric variable to predict the value of some response? The output is surprisingly similar, but the interpretation of the results differ slightly.

As an example, let’s predict the amount each mammal is awake as a function of its brain weight:

There is once again an intercept term and a row of the table corresponding to the new variable brainwt. What do these numbers mean? It turns out that this is simply describing a best-fit line through the data. We have already seen how to do this graphically with geom_smooth. The line here is, exactly, the line given in this plot:

The reg_table function is just giving us the intercept and slope of this line, along with confidence interval bounds for both. Does it make sense that the slope here is negative? It should!

This should explain why the first term is called the intercept. As with the discrete case, there is a special meaning behind whether the confidence interval contains zero. If it does not, we say we have detected a statistically significant linear relationship between our two variables.

Multiple linear regression

Further, and finally, we can add multiple variables into a single regression. It is even possible to mix continuous and categorical variables into the same model:

The interpretation becomes, in this case, the change we would expect to see in the response given a marginal change in one of the explanatory variables on the right-hand side of the model. That is, how do we expect the mean to change if we modify one (and only one) of the other variables. We could spend a lot of time focusing on this distinction, but I don’t want to go too far down this line of thinking.

Fitted Values

As models become more complex it can become difficult to directly compute the predicted values and residuals that come from it. We can use the add_prediction function in R to append the residuals back into the original dataset.

It also adds in the predicted values for each data point. These predictions are often called the fitted values. We can see what this is doing with a simple plot:

As you can see, the predicted values all line along a single line.

Practice

We will work on the next lab for the remainder of the class: lab20.Rmd