Let’s look at an historical dataset of the heights of children relative to the heights of their parents. This comes from a data published by Francis Galton in the 1890s. All heights are given in inches. Note that these children are all fully grown, adults.

heights <- read_csv("../data/galton_heights.csv")
heights
## # A tibble: 898 × 6
##    family father mother gender height  kids
##    <chr>   <dbl>  <dbl> <chr>   <dbl> <dbl>
##  1 1        78.5   67   M        73.2     4
##  2 1        78.5   67   F        69.2     4
##  3 1        78.5   67   F        69       4
##  4 1        78.5   67   F        69       4
##  5 2        75.5   66.5 M        73.5     4
##  6 2        75.5   66.5 M        72.5     4
##  7 2        75.5   66.5 F        65.5     4
##  8 2        75.5   66.5 F        65.5     4
##  9 3        75     64   M        71       2
## 10 3        75     64   F        68       2
## # ℹ 888 more rows

We can use a linear regression to study the height of children as a function of their mother’s height:

model <- lm(height ~ mother, data = heights)
summary(model)
## 
## Call:
## lm(formula = height ~ mother, data = heights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5474 -2.6346 -0.1079  2.8688 11.9526 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.69077    3.25874  14.328  < 2e-16 ***
## mother       0.31318    0.05082   6.163 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.511 on 896 degrees of freedom
## Multiple R-squared:  0.04066,    Adjusted R-squared:  0.03959 
## F-statistic: 37.98 on 1 and 896 DF,  p-value: 1.079e-09

Using the new multivariate approach, we can add information about the height of their mother and the height of their father.

model <- lm(height ~ mother + father, data = heights)
summary(model)
## 
## Call:
## lm(formula = height ~ mother + father, data = heights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.136 -2.700 -0.181  2.768 11.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.30971    4.30690   5.180 2.74e-07 ***
## mother       0.28321    0.04914   5.764 1.13e-08 ***
## father       0.37990    0.04589   8.278 4.52e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.386 on 895 degrees of freedom
## Multiple R-squared:  0.1089, Adjusted R-squared:  0.1069 
## F-statistic: 54.69 on 2 and 895 DF,  p-value: < 2.2e-16

We can also add variables that correspond to categories. For example, we can add a marker for reported gender to the regression as follows:

model <- lm(height ~ mother + father + gender, data = heights)
summary(model)
## 
## Call:
## lm(formula = height ~ mother + father + gender, data = heights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.523 -1.440  0.117  1.473  9.114 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.34476    2.74696   5.586 3.08e-08 ***
## mother       0.32150    0.03128  10.277  < 2e-16 ***
## father       0.40598    0.02921  13.900  < 2e-16 ***
## genderM      5.22595    0.14401  36.289  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.154 on 894 degrees of freedom
## Multiple R-squared:  0.6397, Adjusted R-squared:  0.6385 
## F-statistic:   529 on 3 and 894 DF,  p-value: < 2.2e-16

Notice that the output adds a column called genderM, which is a variable equal to 0 for children labeled as Female and 1 for children labeled as Male (the baseline is always set to the alphabetically first category). The estimate, here about 5.22, provides the expected extra height in inches of male children relative to the female children.

As a final measurement, the R-square value provides a measurement of how much variation in the Ys is explained by the model. It ranges from zero (no explanation) to one (fully explaining the variable). It is defined as the ratio of the variance of the residuals (Y-Xb) divided by the variance of Y. Go back and see how much better our final model is compared to the first attemps at explaining the height of the people in the dataset.