Class 09: Extending Multivariate Models
Multivariate regression with categorical data
Today we are, once again, going to look at another classic datasets in statistics featuring data about a number of automobiles.
Our goal today is to estimate the city fuel efficiency of each car.
It would be reasonable to start with a regression model
displ to predict the response variable.
We can just as easily add categorical data into our
model. Next week we will cover the specifics of what
is internally being done here, but for now let’s just
see what adding the
class variable to the model does
to the output:
Notice that it appears that we now have a separate term for each class of car. If you look more carefully you’ll see that there is no mention of “2seater” in the list. This value is excluded because otherwise we would have perfect collinearity between the variables (a violation of the model assumptions)
The model created here can be thought of as a set of parallel lines, one for each class of car. We can see that here:
Notice, for example, that compact and midsize have very close estimates in the regression model and very close lines on the plot.
Here, we have different offsets for each class but the same slope.
It is possible, easy in fact, to have different slopes and the same
intercept. We simply use the
: sign instead of the
+ sign in
the formula specification.
Here, the model gives the difference between each classes slope and the baseline slope.
Finally, we could use a
* in place of the
: to have different slopes
Extensions to the linear model
Generalized linear models
There are many other functions in R that have similar calling
lm but run different underlying models.
glm fits generalized linear models. These can
be used, amongst other things, for fitting a model to a binary
response. The logit model assumes the following relationship
between the response y and the inputs x:
Which is often simplified using the logit function itself:
Fitting the model is easy; we just use
glm and specify the
It is much harder, impossible perhaps, to fully understand the meaning of the coefficents (though we can make sense of the difference between it being negative, postive, and zero). To get predictions - which we will describe as the probability that y is equal to 1 - we would need to apply the inverse logit function:
Be careful, because the
predict by default returns the linear
combination of the inputs without the inverse logit transformation.
Notice here that the range is not between zero and one:
To get the predicted probabilities, we need to add an option to the predict function:
Of course, we could also apply the inverse logit function ourselves, but that is more error prone and needs to be changed depending on the “family” choosen for the logistic regression.
Robust linear models
In the MASS package (included with all standard R
installations) is the
rlm function for fitting robust
We see that the robust version is much more accurate than the standard regression function in this case:
If you have a need for a specific model, you can usually
find an R package that support it. In most cases, the model
will roughly resemble calling
Some common examples you may run into:
gam::gamfor generalized additive models
nlsfor non-linear regression
lme4::lmerfor mixed effects models
quantreg::qrfor quantila regression
glmnet::glmnetfor the generalized elastic net
randomforest::randomforestfor random forest classifications
forcast::auto.arimafor modeling time series
Thoughts on statistical inference
I am personally quite sceptical of p-values from large linear models. Here is an excellent demo showing how easy p-hacking is:
In general, in order to be valid hypothesis tests, we need to decide on the exact model before seeing any of the data and only run one model on the dataset. While we can run more models and modify them as we observe results, these can only be used for exploratory purposes.