We do not focus much in this class on models. Those are covered fairly extensively in many of the other courses in our data science program. In these notes we will focus on a very specific part of modeling, namely how to create derived data tables that describe the output of a model in a way that respects the 3NF format we saw in the previous notes. As an example, we will use a linear regression. If you have never come across this model, don’t fear. The notes will introduce all of the details that you need.
Let’s start by drawing a scatter plot with total fat on the x-axis
and the numbers of calories of the y-axis using our standard
food
data.
%>%
food ggplot(aes(total_fat, calories)) +
geom_point()
We see that generally these two variables are positively related to one another. As total fat increases so do the calories. We can try to formalize this idea by fitting a model. One of the most popular models for this type of relationship is a linear regression, which assumes that we can relate these two variables according to the following (the subscript i indexes each row of the data):
\[ \text{calories}_i = A + B \cdot \text{total_fat}_i + \text{error}_i \]
In words, we expect the caloric content of each food to be close to
the constant A
(intercept) plus the total fat times the
constant B
(slope). This will not be a perfect fit, so
there is a error that captures the difference between the simple model
and each data point. Visually, we want something like this:
Our goal when building a model is to use the data to figure out the
best slope and intercept to describe the data that we have. For linear
regression, this is done in R using the function lm
. Here
is the syntax and basic output:
<- lm(calories ~ total_fat, data = food)
model model
##
## Call:
## lm(formula = calories ~ total_fat, data = food)
##
## Coefficients:
## (Intercept) total_fat
## 61.73 11.67
Here we have the model’s estimate of the slope and intercept printed
to the screen. We can get a lot more information about the model using
the function summary
summary(model)
##
## Call:
## lm(formula = calories ~ total_fat, data = food)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.578 -31.233 -6.566 23.099 257.264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.7325 7.5978 8.125 3.36e-11 ***
## total_fat 11.6672 0.9545 12.223 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.26 on 59 degrees of freedom
## Multiple R-squared: 0.7169, Adjusted R-squared: 0.7121
## F-statistic: 149.4 on 1 and 59 DF, p-value: < 2.2e-16
There is a lot of information here and we are not going to describe all the probabilistic details of linear regression in these short notes. Hopefully you have or will be able to take a course that focuses on the theory of statistical models.
Our goal in these notes is to organize and structure the output of the model as data. The print out from the summary function above is nice to read but does not give us a way of working with the model output in a programmatic way. How might we structure the model data using table formats we have learned?
Linear regression, and in fact most statistical models, require three different tables to capture all of the information we generally want access to in a normalized format. We can describe these by the different units of measurement:
A
and one for the slope
B
. These observations captures the best guess of the
parameter and also often contains information about how certain we are
about the guess based on statistical assumptions and theory.There is a very helpful R pacakge called broom that will produce these three tables for us when given a model object. Here we’ll show it with a simple linear regression but it works the same way with many different kinds of models. Note that it names new columns using a period where our class convention would usually require an underscore.
To get a table about the entire model, use the function
glance
:
glance(model)
## # A tibble: 1 × 12
## r.squared adj.r.…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.717 0.712 49.3 149. 8.25e-18 1 -323. 653. 659. 143181.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
To get a table about the coefficients, use the function
tidy
:
tidy(model)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 61.7 7.60 8.13 3.36e-11
## 2 total_fat 11.7 0.955 12.2 8.25e-18
And to get the table about the original observations, use the
function augment. Here we will set the parameter newdata
with the table we want to augment (the new items come at the end, so I
use select
to see them).
augment(model, newdata = food) %>%
select(item, total_fat, calories, .fitted, .resid)
## # A tibble: 61 × 5
## item total_fat calories .fitted .resid
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Apple 0.1 52 62.9 -10.9
## 2 Asparagus 0.1 20 62.9 -42.9
## 3 Avocado 14.6 160 232. -72.1
## 4 Banana 0.3 89 65.2 23.8
## 5 Chickpea 2.9 180 95.6 84.4
## 6 String Bean 0.1 31 62.9 -31.9
## 7 Beef 19.5 288 289. -1.24
## 8 Bell Pepper 0 26 61.7 -35.7
## 9 Crab 1 87 73.4 13.6
## 10 Broccoli 0.3 34 65.2 -31.2
## # … with 51 more rows
We can use this augmented data in a pipe to produce a version of plot I showed above:
%>%
model augment(newdata = food) %>%
ggplot(aes(total_fat, calories)) +
geom_point(color = "grey85") +
geom_line(aes(y = .fitted), color = "red", linetype ="dashed")
We won’t spend too much more time talking about the details of the models, but hopefully this helps connect some of the ideas in this class with things you may be doing in other statistical courses and will be very useful in DSST389.
Consider the following plot which shows a linear model applied to a
subset of the hans
dataset to predict life expectancy as a
function of GDP.
Answer the following three questions:
tidy(model)
applied to the model
shown above. Write down an approximation of what the first two columns
(term
and estimate
) would be for the model
show in the red dashed line above.augment(model, newdata = .)
applied to the model and data shown above. Write down an approximation
of the columns country
.fitted
and
.residual
for at least three of the countries.For 1, the actual values for the first two columns of the
tidy
function is given by the following (you don’t need to
have the exact values; I would say from the plot the intercept is a bit
larger than 70 and the slope is around 3 / 10000):
## # A tibble: 2 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 70.3
## 2 gdp 0.000298
For 2, here is the actual output of the augmented model:
## # A tibble: 10 × 5
## country gdp life_exp .fitted .resid
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Venezuela 11416. 73.7 73.7 0.0434
## 2 Canada 36319. 80.7 81.1 -0.468
## 3 Costa Rica 9645. 78.8 73.2 5.61
## 4 Argentina 12779. 75.3 74.1 1.21
## 5 Bolivia 3822. 65.6 71.4 -5.89
## 6 El Salvador 5728. 71.9 72.0 -0.132
## 7 Honduras 3548. 70.2 71.4 -1.16
## 8 Panama 9809. 75.5 73.2 2.31
## 9 Trinidad and Tobago 18009. 69.8 75.7 -5.85
## 10 Chile 13172. 78.6 74.2 4.33
And the smallest and largest absolute residuals are given by:
## # A tibble: 1 × 2
## smallest biggest
## <chr> <chr>
## 1 Venezuela Bolivia
It’s hard to tell on the plot though, and you may have guessed El Salvidor and/or Trinidad and Tobago, respectively.