# Class 10: Local Models

### 2017-09-28

## Income data

For the first half of today, we are going to look at the dataset you used on your last lab.

Looking at the map, we see that income is highly local and highest near large cities (such as NYC, Chicago, LA, and Houston).

While we have seen ways of building models that account for linearites and interactions (GAMs and basis expansion), today’s focus will be on using inherantly local models to learn the response variable here.

### Tuning k-nearest neighbors

The knn model, as we saw last time, requires us to use a model matrix. We will construct one here using only longitude and latitude:

While we usually do not need a specific validation X matrix and response vector y, it will be useful here today. We will fit the k-nearest neighbors algorithm for 25 values of the hyperparameter k, specifically, the integers 1 through 25. I will save the RMSE of the validation set for each value of k.

With these values, we can plot the RMSE as a function of the parameter k:

The optimal value seems to be at k equal to 5, with a drastic degradation for k equal to 1 or 2.

One thing that may be counterintuitive about k-nearest neighbors is the relationship between k and the model complexity. Here is a plot of the fitted values for k equal to 3000 over the Northeastern US:

There are large regions that all have very similar predicted values. In contrast, here are the predicted value from k equal to 3.

Here, the values change rapidly and there are many nearby points with drastically different values. So, a high value of k is a less complex model and a low value is a more complex model. If this seems confusing, think about k being the size of the entire training set (every point will then be equal to a constant: the training set mean). Also perhaps confusing is that models with a larger k take longer to fit; this should make sense as there are more points to average, but goes against our general idea that more complex models are more difficult to fit.

## Clustering with K-means

An alternative to k-nearest neighbors is to split the dataset
into fixed neighborhoods of nearby points and use these as
indicator variables in a model. To do this, we need to cluster
the input data. There are many algorithms available for clustering;
one of the most widely known and implemented is k-means clustering.
This is achieved by way of the `kmeans`

function and requires only
that we set the number of centers. Here is a plot of kmeans applied
to the `acs`

data:

Notice that I have set the random seed as the output of the algorithm is stochastic. Here, longitude and latitude are on similar scales, but generally you should scale the data matrix X prior to clustering.

Let’s now build 20, 100, and 200 member clusters in the dataset:

We can use these in a `glmnet`

model to determine which clusters
should have their own offsets. I am going to use the function
`model.Matrix`

(note the captial M) from the **MatrixModels**
package for efficency purposes. We will discuss the details of
the function once we get to textual data:

And we will fit an elastic net model to the data:

Notice that the output resembles k-nearest neighbors:

There are important differences with the k-nearest neighbors method, however. The regions to group together have been determined adaptively. Notice that upstate New York and Maine have moslty be pushed into the same values. The Metro NYC area, in constrast, has many more regions.

## Decision Trees

We will now look at a new dataset of housing prices in Ames, Iowa. Unlike our other housing datasets though, the samples here are individual houses.

One way to build a predictive algorithm is to describe a decision
tree. The best way to understand what this is is to see an example.
Here, we use the `tree`

library (pre-installed in R) to predict
sales price:

This relatively old library needs a special plotting notation (so don’t worry if this looks strange)

In order to do prediction, we start at the top of the tree and
look at each logical statement. If True, move to the left and
if False move to right. At the bottom (the *terminal nodes* or *leaves*),
there are predicted values. The tree was built by greedily picking
variables to spit the dataset up by until some stopping criterion
was reached; this is usually a fixed depth, a fixed proportion of
the data, or a fixed decrease in the RMSE of accuracy rate.

## Random Forests

Decision trees give very noisy predictions due to their use of greedy logic and because points on the boundary of a decision cut-off are forced into a fixed bucket. By noisy, I mean that a slightly different training set would yield significantly different predictions for at least some of the test points. This may seem like a design flaw, but we can easily turn it into a design feature!

The idea of a random forest is to add some randomness into the
decision tree algorithm, fit a large number of trees using this
random variation, and produce predictions by averaging together
the predictions from all these individual trees (its a *forest*
because there are a lot of trees; get it?). The random logic
applies only the building of the trees; once created, each tree
is exactly the same as in the case above. The randomness comes
from two sources:

- for each tree, select only a subset of the training data to train with
- for each split, select only a subset of the available variables to split on

The exact values for these two random features can be set as
hyperparameters. We can fit random forests using the `randomForest`

function from the package with the same name as follows:

Here I selected 20 randomly generated trees, each having at most
3 terminal nodes and only allowing one variable to be used at
each split. These are very low settings, used only for illustrating
the algorithm here. We can get predictions from each individual
tree by setting `predict.all`

to `TRUE`

:

Here is the prediction for just the third tree:

Can you figure out roughly what the tree looks like? It first splits on overal quality being less than 7.5, and then splits the lower quality houses by year built around 1982. The individual prediction is not very smooth or complex.

Taking all of the twenty trees together, the average model looks quite a bit different:

Helpfully, the **randomForest** also provides the function
`importance`

that measures how important each variable is
to the model.

This is a measurement of how often the variable was used in the model and how much it decreased the RMSE each time it was used to split the dataset.

## Gradient Boosted Trees

Gradient boosted trees offer a slightly different approach to random forests for making use of the noisy nature of decision trees. Like random forests, they construct a number of trees, each using only a subset of the training data. They do not restrict the variables available for each node to split on. Most importantly, gradient boosted trees are fit in a sequence, with each tree trying to predict the residuals left over by the other trees.

More exactly, if the fitted values from the t-th tree are given by:

Then we train the k-th tree on the values Z given by:

The parameter eta is the learning rate. If set to one, this
is exactly fitting on the residuals of the prior trees.
Setting to less than one stop the trees from overfitting from
the first few trees. Here, we prepare a larger set of variables
from the `ames`

dataset:

We will use the **xgboost** package to fit gradient
boosted trees. I will set the eta parameter to 0.02.

And we can do prediction on the dataset:

Alternatively, we can use the function `xgb.DMatrix`

to
combine the data matrix and labels:

And use a more advanced calling method for **xgboost**:

The algorithm is the same, but there are more options available
with `xgb.train`

. As with random forests, there is a way of
looking at variable importance. I don’t like the default output
view, so here is some code to make it look nicer:

## Thoughts on local models

We’ve covered a lot today. Here are some take aways:

- k-nearest neighbors are great for smoothing predictions or creating meta-features
- k is inversely related to the complexity of the model
- using clusters simulated KNN but allows for covariates and adaptation to the data
- don’t use the
`tree`

function for actual predictions; I only used it to illustrate decision trees - random forests are easy to use and difficult to overfit with
- gradient boosted trees are incredibly powerful (often the give the most predictive models in large ML competitions)
- you need to tune eta and number of trees in GBT to get a good model

The last point should be the object of study for the next lab. We will discuss the details more next week when going over the lab.