14. Tidy Models

Normalized Model Outputs

We do not focus much in this class on models. Those are covered fairly extensively in many of the other courses in our data science program. In these notes we will focus on a very specific part of modeling, namely how to create derived data tables that describe the output of a model in a way that respects the 3NF format we saw in the previous notes. As an example, we will use a linear regression. If you have never come across this model, don’t fear. The notes will introduce all of the details that you need.

A Very Brief Visual Introduction to Linear Regression

Let’s start by drawing a scatter plot with total fat on the x-axis and the numbers of calories of the y-axis using our standard food data.

food %>%
  ggplot(aes(total_fat, calories)) +
    geom_point()

We see that generally these two variables are positively related to one another. As total fat increases so do the calories. We can try to formalize this idea by fitting a model. One of the most popular models for this type of relationship is a linear regression, which assumes that we can relate these two variables according to the following (the subscript i indexes each row of the data):

\[ \text{calories}_i = A + B \cdot \text{total_fat}_i + \text{error}_i \]

In words, we expect the caloric content of each food to be close to the constant A (intercept) plus the total fat times the constant B (slope). This will not be a perfect fit, so there is a error that captures the difference between the simple model and each data point. Visually, we want something like this:

Our goal when building a model is to use the data to figure out the best slope and intercept to describe the data that we have. For linear regression, this is done in R using the function lm. Here is the syntax and basic output:

model <- lm(calories ~ total_fat, data = food)
model

## 
## Call:
## lm(formula = calories ~ total_fat, data = food)
## 
## Coefficients:
## (Intercept)    total_fat  
##       61.73        11.67

Here we have the model’s estimate of the slope and intercept printed to the screen. We can get a lot more information about the model using the function summary

summary(model)

## 
## Call:
## lm(formula = calories ~ total_fat, data = food)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.578 -31.233  -6.566  23.099 257.264 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.7325     7.5978   8.125 3.36e-11 ***
## total_fat    11.6672     0.9545  12.223  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.26 on 59 degrees of freedom
## Multiple R-squared:  0.7169, Adjusted R-squared:  0.7121 
## F-statistic: 149.4 on 1 and 59 DF,  p-value: < 2.2e-16

There is a lot of information here and we are not going to describe all the probabilistic details of linear regression in these short notes. Hopefully you have or will be able to take a course that focuses on the theory of statistical models.

Structuring the Model Data

Our goal in these notes is to organize and structure the output of the model as data. The print out from the summary function above is nice to read but does not give us a way of working with the model output in a programmatic way. How might we structure the model data using table formats we have learned?

Linear regression, and in fact most statistical models, require three different tables to capture all of the information we generally want access to in a normalized format. We can describe these by the different units of measurement:

model this is a table where the model is the unit of measurement. In other words, there is just one row for the entire model. It captures overall information about the model fit. This is all of the info at the bottom of the summary print out above.
parameter this table has one row for each parameter that is learned by the model. In the above example there will be two rows, one for the intercept A and one for the slope B. These observations captures the best guess of the parameter and also often contains information about how certain we are about the guess based on statistical assumptions and theory.
observation this table has one row for each observation in the original data. It comes from applying the best guess values from the model back to the data. It includes information about where the model thinks each value should be (the fitted value) and the residual (difference between fitted value and the actual model). We will often put all of the original variables back into this table as well.

There is a very helpful R pacakge called broom that will produce these three tables for us when given a model object. Here we’ll show it with a simple linear regression but it works the same way with many different kinds of models. Note that it names new columns using a period where our class convention would usually require an underscore.

To get a table about the entire model, use the function glance:

glance(model)

## # A tibble: 1 × 12
##   r.squared adj.r.…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
##       <dbl>    <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.717    0.712  49.3    149. 8.25e-18     1  -323.  653.  659. 143181.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹adj.r.squared, ²statistic, ³deviance

To get a table about the coefficients, use the function tidy:

tidy(model)

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     61.7     7.60       8.13 3.36e-11
## 2 total_fat       11.7     0.955     12.2  8.25e-18

And to get the table about the original observations, use the function augment. Here we will set the parameter newdata with the table we want to augment (the new items come at the end, so I use select to see them).

augment(model, newdata = food) %>%
  select(item, total_fat, calories, .fitted, .resid)

## # A tibble: 61 × 5
##    item        total_fat calories .fitted .resid
##    <chr>           <dbl>    <dbl>   <dbl>  <dbl>
##  1 Apple             0.1       52    62.9 -10.9 
##  2 Asparagus         0.1       20    62.9 -42.9 
##  3 Avocado          14.6      160   232.  -72.1 
##  4 Banana            0.3       89    65.2  23.8 
##  5 Chickpea          2.9      180    95.6  84.4 
##  6 String Bean       0.1       31    62.9 -31.9 
##  7 Beef             19.5      288   289.   -1.24
##  8 Bell Pepper       0         26    61.7 -35.7 
##  9 Crab              1         87    73.4  13.6 
## 10 Broccoli          0.3       34    65.2 -31.2 
## # … with 51 more rows

We can use this augmented data in a pipe to produce a version of plot I showed above:

model %>%
  augment(newdata = food) %>%
  ggplot(aes(total_fat, calories)) +
    geom_point(color = "grey85") +
    geom_line(aes(y = .fitted), color = "red", linetype ="dashed")

We won’t spend too much more time talking about the details of the models, but hopefully this helps connect some of the ideas in this class with things you may be doing in other statistical courses and will be very useful in DSST389.

Homework Questions

Consider the following plot which shows a linear model applied to a subset of the hans dataset to predict life expectancy as a function of GDP.

Answer the following three questions:

Consider the output of tidy(model) applied to the model shown above. Write down an approximation of what the first two columns (term and estimate) would be for the model show in the red dashed line above.
Consider the output of the augment(model, newdata = .) applied to the model and data shown above. Write down an approximation of the columns country .fitted and .residual for at least three of the countries.
Which country has a residual with the largest magnitude according to the plot above? Which country has a residual with the smallest magnitude?

Homework Answers

For 1, the actual values for the first two columns of the tidy function is given by the following (you don’t need to have the exact values; I would say from the plot the intercept is a bit larger than 70 and the slope is around 3 / 10000):

## # A tibble: 2 × 2
##   term         estimate
##   <chr>           <dbl>
## 1 (Intercept) 70.3     
## 2 gdp          0.000298

For 2, here is the actual output of the augmented model:

## # A tibble: 10 × 5
##    country                gdp life_exp .fitted  .resid
##    <chr>                <dbl>    <dbl>   <dbl>   <dbl>
##  1 Venezuela           11416.     73.7    73.7  0.0434
##  2 Canada              36319.     80.7    81.1 -0.468 
##  3 Costa Rica           9645.     78.8    73.2  5.61  
##  4 Argentina           12779.     75.3    74.1  1.21  
##  5 Bolivia              3822.     65.6    71.4 -5.89  
##  6 El Salvador          5728.     71.9    72.0 -0.132 
##  7 Honduras             3548.     70.2    71.4 -1.16  
##  8 Panama               9809.     75.5    73.2  2.31  
##  9 Trinidad and Tobago 18009.     69.8    75.7 -5.85  
## 10 Chile               13172.     78.6    74.2  4.33

And the smallest and largest absolute residuals are given by:

## # A tibble: 1 × 2
##   smallest  biggest
##   <chr>     <chr>  
## 1 Venezuela Bolivia

It’s hard to tell on the plot though, and you may have guessed El Salvidor and/or Trinidad and Tobago, respectively.