Extend the inference methods from last class to measure the difference between two or more means.
Comparing Two Means
lm_basic function allows for much more complex models than
describing a simple mean. Consider a second set of data where
coins have been taken from two different cups:
In this case, we may want to model the mean of both cups. To
do this with
lm_basic, we just add the new variable to the
How do we read this new table? The intercept gives the mean value for the A cup and the second term, called a slope, gives the additional amount needed to get the mean of the coins from cup B. So, the best guess of cup B’s mean is equal to 3. What does the table look like when we add confidence intervals:
The mean of cup A is predicted to be between 0.5872 and 4.913. The difference between cup B’s mean and cup A’s mean is somewhere between -2.8086 and 3.309. Because this difference includes zero, we say that there is no statistical evidence (at the 95% level) that the mean of the two cups is different… Think about this statement for a bit. Why would a value of zero be important in this model?
Comparing Three Categories
Let’s apply this to a more complex situation using the mammals sleep dataset. We can model the average time spent awake as a function of the diet type of a given mammal:
Now each of these values gives the difference between the base level, carnivores, and all of the others. So the predicted mean for hours spent awake for insectivores is 13.6263 + (-4.5662), or about 9 hours. The confidence intervals tell us whether there is evidence that a given diet type is different from carnivores. We see, for example, that there is statistical evidence that insectivores differ from carnivores, but no evidence for distinctions between carnivores and any other groups.
The careful observer will notice that there is a problem with this
approach: what if we want to compare two values when one is not the
base level? To do so, use the
fct_relevel command with the new
baseline used as the second parameter:
Now everything is compared to the
We will not discuss all of the output of the
reg_table function in
this course, but one other piece of information will come in handy:
the Multiple R-squared. We define the residual of a model as the
difference between the actual observed response minus the fitted
mean. So, in this case we would have the following residuals:
Then, the multiple R-squared is given by:
This quantity will always be between 0 and 1, with values closer to 1 indicating that the model is describing a large proportion of the variation in the response variable. Notice that the R-squared value is always equal to 0 for a simple model with just a single mean.
We will work on the next lab for the remainder of the class: lab19.Rmd
Please upload your script to GitHub ahead of the next class.