Class 09: Extending Multivariate Models

2017-09-26

library(readr)
library(ggplot2)
library(dplyr)
library(viridis)

Multivariate regression with categorical data

Cars Dataset

Today we are, once again, going to look at another classic datasets in statistics featuring data about a number of automobiles.

cars <- read_csv("https://statsmaths.github.io/stat_data/mpg.csv")

Our goal today is to estimate the city fuel efficiency of each car.

Categorical predictors

It would be reasonable to start with a regression model that uses displ to predict the response variable. We can just as easily add categorical data into our model. Next week we will cover the specifics of what is internally being done here, but for now let’s just see what adding the class variable to the model does to the output:

model <- lm(cty ~ displ + class,
            data = cars)
cars$cty_pred <- predict(model, newdata = cars)
model
## 
## Call:
## lm(formula = cty ~ displ + class, data = cars)
## 
## Coefficients:
##     (Intercept)            displ     classcompact     classmidsize  
##          28.777           -2.172           -3.599           -3.676  
##    classminivan      classpickup  classsubcompact         classsuv  
##          -5.595           -6.182           -2.629           -5.599

Notice that it appears that we now have a separate term for each class of car. If you look more carefully you’ll see that there is no mention of “2seater” in the list. This value is excluded because otherwise we would have perfect collinearity between the variables (a violation of the model assumptions)

The model created here can be thought of as a set of parallel lines, one for each class of car. We can see that here:

ggplot(cars, aes(displ, cty_pred)) +
  geom_line(aes(color = class)) +
  geom_point(aes(color = class))

plot of chunk unnamed-chunk-4

Notice, for example, that compact and midsize have very close estimates in the regression model and very close lines on the plot.

Here, we have different offsets for each class but the same slope. It is possible, easy in fact, to have different slopes and the same intercept. We simply use the : sign instead of the + sign in the formula specification.

model <- lm(cty ~ displ:class,
            data = cars)
cars$cty_pred <- predict(model, newdata = cars)
model
## 
## Call:
## lm(formula = cty ~ displ:class, data = cars)
## 
## Coefficients:
##           (Intercept)     displ:class2seater     displ:classcompact  
##                25.316                 -1.602                 -2.311  
##    displ:classmidsize     displ:classminivan      displ:classpickup  
##                -2.208                 -2.779                 -2.755  
## displ:classsubcompact         displ:classsuv  
##                -2.026                 -2.601

Here, the model gives the difference between each classes slope and the baseline slope.

ggplot(cars, aes(displ, cty_pred)) +
  geom_line(aes(color = class)) +
  geom_point(aes(color = class))

plot of chunk unnamed-chunk-6

Finally, we could use a * in place of the : to have different slopes and intercepts.

Extensions to the linear model

Generalized linear models

There are many other functions in R that have similar calling mechanisms to lm but run different underlying models.

For example, glm fits generalized linear models. These can be used, amongst other things, for fitting a model to a binary response. The logit model assumes the following relationship between the response y and the inputs x:

Which is often simplified using the logit function itself:

Fitting the model is easy; we just use glm and specify the binomial model:

df <- data_frame(y = c(0,0,0,0,1,1,1,1), x = rnorm(8))
model <- glm(y ~ x, data = df, family = binomial())
summary(model)
## 
## Call:
## glm(formula = y ~ x, family = binomial(), data = df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.23107  -1.10197  -0.03086   0.95789   1.68198  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.2204     0.8021  -0.275    0.784
## x            -0.4092     0.5599  -0.731    0.465
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.090  on 7  degrees of freedom
## Residual deviance: 10.505  on 6  degrees of freedom
## AIC: 14.505
## 
## Number of Fisher Scoring iterations: 4

It is much harder, impossible perhaps, to fully understand the meaning of the coefficents (though we can make sense of the difference between it being negative, postive, and zero). To get predictions - which we will describe as the probability that y is equal to 1 - we would need to apply the inverse logit function:

Be careful, because the predict by default returns the linear combination of the inputs without the inverse logit transformation. Notice here that the range is not between zero and one:

df$pred <- predict(model, newdata = df)
quantile(df$pred)
##          0%         25%         50%         75%        100% 
## -1.13607798 -0.28619477  0.03199053  0.51702909  0.64590315

To get the predicted probabilities, we need to add an option to the predict function:

df$pred <- predict(model, newdata = df, type = "response")
quantile(df$pred)
##        0%       25%       50%       75%      100% 
## 0.2430412 0.4291953 0.5079796 0.6264398 0.6560867

Of course, we could also apply the inverse logit function ourselves, but that is more error prone and needs to be changed depending on the “family” choosen for the logistic regression.

Robust linear models

In the MASS package (included with all standard R installations) is the rlm function for fitting robust linear regression:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
x <- rnorm(100)
y <- 1 + x + rnorm(100, sd = 0.2)
y[50] <- 1e4
model_lm <- lm(y ~ x)
model_rlm <- rlm(y ~ x)

We see that the robust version is much more accurate than the standard regression function in this case:

summary(model_lm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -276.9 -129.8  -92.3  -63.4 9868.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    91.59     101.78   0.900    0.370
## x             -56.70     104.35  -0.543    0.588
## 
## Residual standard error: 1003 on 98 degrees of freedom
## Multiple R-squared:  0.003004,	Adjusted R-squared:  -0.007169 
## F-statistic: 0.2953 on 1 and 98 DF,  p-value: 0.5881
summary(model_rlm)
## 
## Call: rlm(formula = y ~ x)
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.254e-01 -1.331e-01 -5.613e-03  1.491e-01  1.000e+04 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept)  1.0159  0.0222    45.8186
## x            0.9812  0.0227    43.1636
## 
## Residual standard error: 0.2161 on 98 degrees of freedom

Other models

If you have a need for a specific model, you can usually find an R package that support it. In most cases, the model will roughly resemble calling lm.

Some common examples you may run into:

  • gam::gam for generalized additive models
  • nls for non-linear regression
  • lme4::lmer for mixed effects models
  • quantreg::qr for quantila regression
  • glmnet::glmnet for the generalized elastic net
  • randomforest::randomforest for random forest classifications
  • forcast::auto.arima for modeling time series

Thoughts on statistical inference

I am personally quite sceptical of p-values from large linear models. Here is an excellent demo showing how easy p-hacking is:

Hack Your Way To Scientific Glory

In general, in order to be valid hypothesis tests, we need to decide on the exact model before seeing any of the data and only run one model on the dataset. While we can run more models and modify them as we observe results, these can only be used for exploratory purposes.