Let’s look at an historical dataset of the heights of children relative to the heights of their parents. This comes from a data published by Francis Galton in the 1890s. All heights are given in inches. Note that these children are all fully grown, adults.
## # A tibble: 898 × 6
## family father mother gender height kids
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 78.5 67 M 73.2 4
## 2 1 78.5 67 F 69.2 4
## 3 1 78.5 67 F 69 4
## 4 1 78.5 67 F 69 4
## 5 2 75.5 66.5 M 73.5 4
## 6 2 75.5 66.5 M 72.5 4
## 7 2 75.5 66.5 F 65.5 4
## 8 2 75.5 66.5 F 65.5 4
## 9 3 75 64 M 71 2
## 10 3 75 64 F 68 2
## # ℹ 888 more rows
We can use a linear regression to study the height of children as a function of their mother’s height:
##
## Call:
## lm(formula = height ~ mother, data = heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5474 -2.6346 -0.1079 2.8688 11.9526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.69077 3.25874 14.328 < 2e-16 ***
## mother 0.31318 0.05082 6.163 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.511 on 896 degrees of freedom
## Multiple R-squared: 0.04066, Adjusted R-squared: 0.03959
## F-statistic: 37.98 on 1 and 896 DF, p-value: 1.079e-09
Using the new multivariate approach, we can add information about the height of their mother and the height of their father.
##
## Call:
## lm(formula = height ~ mother + father, data = heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.136 -2.700 -0.181 2.768 11.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.30971 4.30690 5.180 2.74e-07 ***
## mother 0.28321 0.04914 5.764 1.13e-08 ***
## father 0.37990 0.04589 8.278 4.52e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.386 on 895 degrees of freedom
## Multiple R-squared: 0.1089, Adjusted R-squared: 0.1069
## F-statistic: 54.69 on 2 and 895 DF, p-value: < 2.2e-16
We can also add variables that correspond to categories. For example, we can add a marker for reported gender to the regression as follows:
##
## Call:
## lm(formula = height ~ mother + father + gender, data = heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.523 -1.440 0.117 1.473 9.114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.34476 2.74696 5.586 3.08e-08 ***
## mother 0.32150 0.03128 10.277 < 2e-16 ***
## father 0.40598 0.02921 13.900 < 2e-16 ***
## genderM 5.22595 0.14401 36.289 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.154 on 894 degrees of freedom
## Multiple R-squared: 0.6397, Adjusted R-squared: 0.6385
## F-statistic: 529 on 3 and 894 DF, p-value: < 2.2e-16
Notice that the output adds a column called genderM
,
which is a variable equal to 0 for children labeled as Female and 1 for
children labeled as Male (the baseline is always set to the
alphabetically first category). The estimate, here about 5.22, provides
the expected extra height in inches of male children relative to the
female children.
As a final measurement, the R-square value provides a measurement of how much variation in the Ys is explained by the model. It ranges from zero (no explanation) to one (fully explaining the variable). It is defined as the ratio of the variance of the residuals (Y-Xb) divided by the variance of Y. Go back and see how much better our final model is compared to the first attemps at explaining the height of the people in the dataset.