Class 13: Inference Several Means

Learning Objectives

  • Apply the lm_basic command to compare the means of two groups
  • Extend the two-sample t-test to multiple groups
  • Apply model-fit metrics to a regression analysis

Comparing Two Means

The lm_basic function allows for much more complex models than describing a simple mean. Consider a second set of data where coins have been taken from two different cups:

coins2
## # A tibble: 8 x 2
##   number cup  
##    <dbl> <chr>
## 1      1 A    
## 2      1 A    
## 3      4 A    
## 4      5 A    
## 5      1 B    
## 6      3 B    
## 7      4 B    
## 8      4 B

In this case, we may want to model the mean of both cups. To do this with lm_basic, we just add the new variable to the formula:

model <- lm_basic(number ~ 1 + cup, data = coins2)
reg_table(model)
## 
## Call:
## lm_basic(formula = number ~ 1 + cup, data = coins2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.000 -1.750  0.500  1.062  2.250 
## 
## Coefficients:
##             Estimate
## (Intercept)     2.75
## cupB            0.25
## 
## Residual standard error: 1.768 on 6 degrees of freedom
## Multiple R-squared:  0.006623,	Adjusted R-squared:  -0.1589 
## F-statistic:  0.04 on 1 and 6 DF,  p-value: 0.8481

How do we read this new table? The intercept gives the mean value for the A cup and the second term, called a slope, gives the additional amount needed to get the mean of the coins from cup B. So, the best guess of cup B’s mean is equal to 3. What does the table look like when we add confidence intervals:

reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = number ~ 1 + cup, data = coins2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.000 -1.750  0.500  1.062  2.250 
## 
## Coefficients:
##             Estimate   2.5 % 97.5 %
## (Intercept)   2.7500  0.5872  4.913
## cupB          0.2500 -2.8086  3.309
## 
## Residual standard error: 1.768 on 6 degrees of freedom
## Multiple R-squared:  0.006623,	Adjusted R-squared:  -0.1589 
## F-statistic:  0.04 on 1 and 6 DF,  p-value: 0.8481

The mean of cup A is predicted to be between 0.5872 and 4.913. The difference between cup B’s mean and cup A’s mean is somewhere between -2.8086 and 3.309. Because this difference includes zero, we say that there is no statistical evidence (at the 95% level) that the mean of the two cups is different… Think about this statement for a bit. Why would a value of zero be important in this model?

Note: I’m not calling it this in these notes, but just so you are aware the statistical model we are applying is called a “t-test”.

Comparing Three Categories

Let’s apply this to a more complex situation using the mammals sleep dataset. We can model the average time spent awake as a function of the diet type of a given mammal:

model <- lm_basic(awake ~ 1 + vore, data = msleep)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = awake ~ 1 + vore, data = msleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0263 -4.0128  0.4237  3.4255  7.7237 
## 
## Coefficients:
##             Estimate   2.5 % 97.5 %
## (Intercept)  13.6263 11.5832 15.669
## voreherbi     0.8643 -1.7150  3.444
## voreinsecti  -4.5663 -9.0425 -0.090
## voreomni     -0.5513 -3.4044  2.302
## 
## Residual standard error: 4.467 on 72 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.08514,	Adjusted R-squared:  0.04702 
## F-statistic: 2.234 on 3 and 72 DF,  p-value: 0.09162

Now each of these values gives the difference between the base level, carnivores, and all of the others. So the predicted mean for hours spent awake for insectivores is 13.6263 + (-4.5662), or about 9 hours. The confidence intervals tell us whether there is evidence that a given diet type is different from carnivores. We see, for example, that there is statistical evidence that insectivores differ from carnivores, but no evidence for distinctions between carnivores and any other groups.

The careful observer will notice that there is a problem with this approach: what if we want to compare two values when one is not the base level? To do so, use the fct_relevel command with the new baseline used as the second parameter:

msleep$vore_new <- fct_relevel(msleep$vore, "insecti")
model <- lm_basic(awake ~ 1 + vore_new, data = msleep)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = awake ~ 1 + vore_new, data = msleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0263 -4.0128  0.4237  3.4255  7.7237 
## 
## Coefficients:
##               Estimate   2.5 % 97.5 %
## (Intercept)     9.0600  5.0773 13.043
## vore_newcarni   4.5663  0.0901  9.043
## vore_newherbi   5.4306  1.1480  9.713
## vore_newomni    4.0150 -0.4378  8.468
## 
## Residual standard error: 4.467 on 72 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.08514,	Adjusted R-squared:  0.04702 
## F-statistic: 2.234 on 3 and 72 DF,  p-value: 0.09162

Now everything is compared to the insecti category.

Graphing the model

Coming back to graphics, notice that we can see the same output using a visual representation of the data. All we need is the geom_confint layer (I had to remove some missing data before running the plot):

msleep_nona <- filter(msleep, !is.na(vore))
ggplot(msleep_nona, aes(vore, awake)) +
  geom_confint()

plot of chunk unnamed-chunk-7

You’ll notice that this not exactly the same output. The interval for carnivores matches well, but the one for insectivores is much wider in this model than in the lm_basic output. I do not want to get too into the weeds about why this is, other it has to do with the fact that there are very few insectivores:

table(msleep$vore)
## 
##   carni   herbi insecti    omni 
##      19      32       5      20

The lm_basic model assumes that all of the data has the same standard deviation, whereas the graphical model estimates the variation for each group. In most cases the difference is small, but when we do not have much data there can be sizeable differences.

Model Fit

We will not discuss all of the output of the reg_table function in this course, but one other piece of information will come in handy: the Multiple R-squared. We define the residual of a model as the difference between the actual observed response minus the fitted mean. So, in this case we would have the following residuals:

Then, the multiple R-squared is given by:

This quantity will always be between 0 and 1, with values closer to 1 indicating that the model is describing a large proportion of the variation in the response variable. Notice that the R-squared value is always equal to 0 for a simple model with just a single mean.