# Class 09: Extending Multivariate Models

### 2017-09-26

## Multivariate regression with categorical data

### Cars Dataset

Today we are, once again, going to look at another classic datasets in statistics featuring data about a number of automobiles.

Our goal today is to estimate the city fuel efficiency of each car.

### Categorical predictors

It would be reasonable to start with a regression model
that uses `displ`

to predict the response variable.
We can just as easily add categorical data into our
model. Next week we will cover the specifics of what
is internally being done here, but for now let’s just
see what adding the `class`

variable to the model does
to the output:

Notice that it appears that we now have a separate term for each class of car. If you look more carefully you’ll see that there is no mention of “2seater” in the list. This value is excluded because otherwise we would have perfect collinearity between the variables (a violation of the model assumptions)

The model created here can be thought of as a set of parallel lines, one for each class of car. We can see that here:

Notice, for example, that compact and midsize have very close estimates in the regression model and very close lines on the plot.

Here, we have different offsets for each class but the same slope.
It is possible, easy in fact, to have different slopes and the same
intercept. We simply use the `:`

sign instead of the `+`

sign in
the formula specification.

Here, the model gives the difference between each classes slope and the baseline slope.

Finally, we could use a `*`

in place of the `:`

to have different slopes
and intercepts.

## Extensions to the linear model

### Generalized linear models

There are many other functions in R that have similar calling
mechanisms to `lm`

but run different underlying models.

For example, `glm`

fits generalized linear models. These can
be used, amongst other things, for fitting a model to a binary
response. The *logit* model assumes the following relationship
between the response y and the inputs x:

Which is often simplified using the logit function itself:

Fitting the model is easy; we just use `glm`

and specify the
binomial model:

It is much harder, impossible perhaps, to fully understand the meaning of the coefficents (though we can make sense of the difference between it being negative, postive, and zero). To get predictions - which we will describe as the probability that y is equal to 1 - we would need to apply the inverse logit function:

Be careful, because the `predict`

by default returns the linear
combination of the inputs without the inverse logit transformation.
Notice here that the range is not between zero and one:

To get the predicted probabilities, we need to add an option to the predict function:

Of course, we could also apply the inverse logit function ourselves, but that is more error prone and needs to be changed depending on the “family” choosen for the logistic regression.

### Robust linear models

In the **MASS** package (included with all standard R
installations) is the `rlm`

function for fitting robust
linear regression:

We see that the robust version is much more accurate than the standard regression function in this case:

### Other models

If you have a need for a specific model, you can usually
find an R package that support it. In most cases, the model
will roughly resemble calling `lm`

.

Some common examples you may run into:

`gam::gam`

for generalized additive models`nls`

for non-linear regression`lme4::lmer`

for mixed effects models`quantreg::qr`

for quantila regression`glmnet::glmnet`

for the generalized elastic net`randomforest::randomforest`

for random forest classifications`forcast::auto.arima`

for modeling time series

### Thoughts on statistical inference

I am personally quite sceptical of p-values from large linear models.
Here is an excellent demo showing how easy *p-hacking* is:

Hack Your Way To Scientific Glory

In general, in order to be valid hypothesis tests, we need to decide on the exact model before seeing any of the data and only run one model on the dataset. While we can run more models and modify them as we observe results, these can only be used for exploratory purposes.