# Class 04: Fuel Efficiency in the Big City

### 2017-09-07

## Cars Dataset

Today we are going to look at another classic datasets in statistics featuring data about a number of automobiles.

Our goal today is to estimate the city fuel efficency of each car. The variable is fairly concentrated around a few values without any extreme outliers:

The variable `displ`

is negatively related to the fuel efficency
of the car, as we can see in the following plot:

Within this plot, we can mark the classes of cars.

Do the car classes generally seem to be where you would expect them?

## Multivariate regression with categorical data

It would be reasonable to start with a regression model
that uses `displ`

to predict the response variable.
We can just as easily add categorical data into our
model. Next week we will cover the specifics of what
is internally being done here, but for now let’s just
see what adding the `class`

variable to the model does
to the output:

Notice that it appears that we now have a separate term for each class of car. If you look more carefully you’ll see that there is no mention of “2seater” in the list. This value is excluded because otherwise we would have perfect collinearity between the variables (a violation of the model assumptions)

The model created here can be thought of as a set of parallel lines, one for each class of car. We can see that here:

Notice, for example, that compact and midsize have very close estimates in the regression model and very close lines on the plot.

Here, we have different offsets for each class but the same slope.
It is possible, easy in fact, to have different slopes and the same
intercept. We simply use the `:`

sign instead of the `+`

sign in
the formula specification.

Here, the model gives the difference between each classes slope and the baseline slope.

Finally, we could use a `*`

in place of the `:`

to have different slopes
and intercepts.

## Validation Sets

One potential difficulty with our approach so far is that we are using the same data to validate models that we used to construct them. In this simple case it is unlikely to cause problems, but it will be a huge problem going forward as we look towards more complex prediction schemes.

So far, I have discussed that the observations are split into two
groups: those where we know the response and those where we need
to predict the responses. Looking at a table of the `train_id`

variable we see that their are actual three subsets of data:

The response is missing only on the 47 observations labels with
the test flag. What is the purpose of the validation label? When
building models, we should only train using the training set.
The validation set then can be used to *validate* how good the
model is working. We can use it to make decisions about which of
the various models we want to use for our final prediction.

In order to only using the training set when fitting a model with
the `lm`

function, we add an option called `subset`

:

Notice that the exact values for the linear model have changed slightly, though as mentioned previously in this simple case we would not expect the parameters to change much when removing the validation set.

Now, we will update our RMSE code to produce the RMSE on each training
id type; the testing variable will be missing (`NA`

) for you, but when
I run this code on the full data I will be able to see how well you
did on the hidden data. Don’t worry about the specifics of the function
`tapply`

here. I suggest just copying this code whenever you need it
and changing the response and dataset names as appropriate:

Notice here that the training dataset actually has a worse RMSE than the validation data. This is a common phenomenon with small and medium sized datasets but often strikes students as very confusing. Shouldn’t the model do better on data it was trained with compared to data it was not trained on?

The explanation is simply that, due to random chance, the validation data is noisier than the training data. So, naturally the RMSE of a simple model is lower on the validation set regardless of which chunk the model was fit on.

Now, let’s try to add the manufacturer of each car to the model:

The RMSE of this new model makes improvements on both the training and validation sets:

However, notice that the training set improved significantly
more than the validation set. We are starting to see the effects
of what is known as *overfitting*, where the model starts
predicting random noises in the training data that do not generalize
into the validation set. This will be a major theme throughout
the entire semester.

## Solving OLS

Let’s now return to calculating the solution to the linear regression model. I want to illustrate two general techniques that will help us understand the more complex models we’ll cover in the upcoming weeks.

Recall that we are assuming that data are distributed as:

We can define the OLS estimator by giving a name to the sum of squares:

Then, the OLS estimator is given by an optimization problem:

This is a very useful formulation because most ML algorithms are written as optimization problems. It will be helpful to compute the partial derivaties of f with respect to alpha and beta:

And

Putting these together, the gradient of f is just:

Of course, one way to solve this equation would be to set both derivatives equal to zero. The ability to do this is a privilege of the OLS model and will not generalize to other models. Instead, lets use the gradient and partial derivatives to iteratively solve the equation. There are two basic techniques for doing this, both useful in different situations.

## Coordinate descent

The idea of Coordinate descent is to cycle through the variables, optimizing our function univariately one variable at a time. Doing this many times seems like a reasonable approach to minimizing the function.

So, if we start with alpha, how would we update alpha given a fixed beta? We just set the partial derivative to zero:

Similarly, if we have a fixed alpha, to update beta we set the partial derivative to zero:

Let’s implement this in R, saving our results as we go:

The somewhat complex code at the end comes from my desire to
save each individual step as its own row in the dataset `df`

in order to plot exactly what is going on.

## Gradient descent

Coordinate descent forces us to iterate parallel to the axes of the data. It is like being constrained to moving along blocks in NYC rather than cutting across by more direct diagonals. Gradient descent modifies this approach by moving a small amount in the direction (or opposite direction, in the case of minimization) of the gradient of the function f. Specifically, we iteratively update according to:

Where rho is some fixed tuning parameter that determines how long each step is. It will be useful to write our algorithm for solving OLS via gradient descent as a function. Functions in R are blocks of code that we can automate with modified parameters. For example, the following adds the first input to twice the second input:

We’ll write a function to fit gradient descent:

Then, we will apply this function to your dataset with an initial guess at the value rho:

And, similar to the coordinate descent, we will plot the path of the algorithm:

Notice that the algorithm takes a few steps to get a good ratio between alpha and beta and then slowly marches towards the optimizer at the red dot. The final solution is still far away, however, and we need to go more iterations in order to be more precise:

Even 1000 steps has a noticeable error rate, let’s increase it to 10k:

Now, at least at this scale, the optimization task approachs the optimal value. In case you are wondering about why the path takes a seemingly circuitous route, we can plot the objective function to understand its overall shape:

Of course, one way of approaching the optimal value faster is to increase the learning rate rho. By setting it to 0.06, we visually converge around step 500:

Now, you may wonder why we do not just set the learning rate very high. Let’s try to set it to 0.08 (not too much higher than the 0.06 above). I’ll set the number of iterations to just 10 and label the actual steps on the plot:

Something strange is going on here; the larger learning rate is moving the points *away* from
the optimum. If we go out a few hundred iterations, the algorithm will actual run into numerical
overflow. The issue that with this new step size we are moving in the correct direction by taking
a step that is more than twice the distance to the minimum in the direction of the gradient.

Of course, there are fixes we can put into the iteration to fix this problem. For example, we could try several different step sizes at each point and only take the minimum (a simple version of a line search). We could also, in this case, use the second derivative to better estimate how large of a step to take. We’ll cover these variations and more throughout the next few weeks.