For the first half of today, we are going to look at the dataset
you used on your last lab.
Looking at the map, we see that income is highly local and highest
near large cities (such as NYC, Chicago, LA, and Houston).
While we have seen ways of building models that account for
linearites and interactions (GAMs and basis expansion), today’s
focus will be on using inherantly local models to learn the
response variable here.
Tuning k-nearest neighbors
The knn model, as we saw last time, requires us to use
a model matrix. We will construct one here using only
longitude and latitude:
While we usually do not need a specific validation X matrix
and response vector y, it will be useful here today. We will
fit the k-nearest neighbors algorithm for 25 values of the
hyperparameter k, specifically, the integers 1 through 25.
I will save the RMSE of the validation set for each value of
With these values, we can plot the RMSE as a function of
the parameter k:
The optimal value seems to be at k equal to 5, with a drastic
degradation for k equal to 1 or 2.
One thing that may be counterintuitive about k-nearest neighbors
is the relationship between k and the model complexity. Here is
a plot of the fitted values for k equal to 3000 over the Northeastern
There are large regions that all have very similar predicted
values. In contrast, here are the predicted value from k equal to
Here, the values change rapidly and there are many nearby
points with drastically different values. So, a high value
of k is a less complex model and a low value is a more
complex model. If this seems confusing, think about k being
the size of the entire training set (every point will then
be equal to a constant: the training set mean). Also perhaps
confusing is that models with a larger k take longer to fit;
this should make sense as there are more points to average,
but goes against our general idea that more complex models
are more difficult to fit.
Clustering with K-means
An alternative to k-nearest neighbors is to split the dataset
into fixed neighborhoods of nearby points and use these as
indicator variables in a model. To do this, we need to cluster
the input data. There are many algorithms available for clustering;
one of the most widely known and implemented is k-means clustering.
This is achieved by way of the kmeans function and requires only
that we set the number of centers. Here is a plot of kmeans applied
to the acs data:
Notice that I have set the random seed as the output of the algorithm
is stochastic. Here, longitude and latitude are on similar scales, but
generally you should scale the data matrix X prior to clustering.
Let’s now build 20, 100, and 200 member clusters in the dataset:
We can use these in a glmnet model to determine which clusters
should have their own offsets. I am going to use the function
model.Matrix (note the captial M) from the MatrixModels
package for efficency purposes. We will discuss the details of
the function once we get to textual data:
And we will fit an elastic net model to the data:
Notice that the output resembles k-nearest neighbors:
There are important differences with the k-nearest neighbors
method, however. The regions to group together have been determined
adaptively. Notice that upstate New York and Maine have moslty be
pushed into the same values. The Metro NYC area, in constrast, has
many more regions.
We will now look at a new dataset of housing prices in Ames,
Iowa. Unlike our other housing datasets though, the samples
here are individual houses.
One way to build a predictive algorithm is to describe a decision
tree. The best way to understand what this is is to see an example.
Here, we use the tree library (pre-installed in R) to predict
This relatively old library needs a special plotting notation (so
don’t worry if this looks strange)
In order to do prediction, we start at the top of the tree and
look at each logical statement. If True, move to the left and
if False move to right. At the bottom (the terminal nodes or leaves),
there are predicted values. The tree was built by greedily picking
variables to spit the dataset up by until some stopping criterion
was reached; this is usually a fixed depth, a fixed proportion of
the data, or a fixed decrease in the RMSE of accuracy rate.
Decision trees give very noisy predictions due to their use of
greedy logic and because points on the boundary of a decision
cut-off are forced into a fixed bucket. By noisy, I mean that a
slightly different training set would yield significantly
different predictions for at least some of the test points.
This may seem like a design flaw, but we can easily turn it into
a design feature!
The idea of a random forest is to add some randomness into the
decision tree algorithm, fit a large number of trees using this
random variation, and produce predictions by averaging together
the predictions from all these individual trees (its a forest
because there are a lot of trees; get it?). The random logic
applies only the building of the trees; once created, each tree
is exactly the same as in the case above. The randomness comes
from two sources:
for each tree, select only a subset of the training data to
for each split, select only a subset of the available variables
to split on
The exact values for these two random features can be set as
hyperparameters. We can fit random forests using the randomForest
function from the package with the same name as follows:
Here I selected 20 randomly generated trees, each having at most
3 terminal nodes and only allowing one variable to be used at
each split. These are very low settings, used only for illustrating
the algorithm here. We can get predictions from each individual
tree by setting predict.all to TRUE:
Here is the prediction for just the third tree:
Can you figure out roughly what the tree looks like? It first
splits on overal quality being less than 7.5, and then splits
the lower quality houses by year built around 1982. The individual
prediction is not very smooth or complex.
Taking all of the twenty trees together, the average model
looks quite a bit different:
Helpfully, the randomForest also provides the function
importance that measures how important each variable is
to the model.
This is a measurement of how often the variable was used
in the model and how much it decreased the RMSE each time it
was used to split the dataset.
Gradient Boosted Trees
Gradient boosted trees offer a slightly different approach
to random forests for making use of the noisy nature of
decision trees. Like random forests, they construct a
number of trees, each using only a subset of the training
data. They do not restrict the variables available for
each node to split on. Most importantly, gradient boosted
trees are fit in a sequence, with each tree trying to predict
the residuals left over by the other trees.
More exactly, if the fitted values from the t-th tree
are given by:
Then we train the k-th tree on the values Z given by:
The parameter eta is the learning rate. If set to one, this
is exactly fitting on the residuals of the prior trees.
Setting to less than one stop the trees from overfitting from
the first few trees. Here, we prepare a larger set of variables
from the ames dataset:
We will use the xgboost package to fit gradient
boosted trees. I will set the eta parameter to 0.02.
And we can do prediction on the dataset:
Alternatively, we can use the function xgb.DMatrix to
combine the data matrix and labels:
And use a more advanced calling method for xgboost:
The algorithm is the same, but there are more options available
with xgb.train. As with random forests, there is a way of
looking at variable importance. I don’t like the default output
view, so here is some code to make it look nicer:
Thoughts on local models
We’ve covered a lot today. Here are some take aways:
k-nearest neighbors are great for smoothing predictions
or creating meta-features
k is inversely related to the complexity of the model
using clusters simulated KNN but allows for covariates
and adaptation to the data
don’t use the tree function for actual predictions; I
only used it to illustrate decision trees
random forests are easy to use and difficult to overfit
gradient boosted trees are incredibly powerful (often
the give the most predictive models in large ML competitions)
you need to tune eta and number of trees in GBT to get
a good model
The last point should be the object of study for the next
lab. We will discuss the details more next week when going
over the lab.