Today I am going to look at a dataset of taxi rides in NYC. For each
ride we want to predict how long the journey took:
Notice that we have only a limited amount of information about each
time: day, hour, minute, and weekday
location: pickup and dropoff latitude / longitude / neighborhood
derived variables: trip distance
As way of review, we’ll go over all of the primary estimators we have
learned and show how they might be applied to this prediction problem.
One key thing here is to learn how we would approach a problem like
this. Models should usually not be applied in the order we learned
Model Selection and Baseline
My first step is to figure out which variables might be of interest and to
establish a quick baseline for how well I can generally expect to estimate
the output. Here we have only 11 variables, and only two are directly
categorical. This is a borderline case between using xgboost and
glmnet, so let’s use both
To use the either for model selection, I will usually put all of the
variables into the model. I leave out only very grainular categorical
variables (such as the player name in the NBA shots data).
Now, we can apply the elastic net. Setting alpha equal to 0.9 is a good
Above, I printed out only those rows where the 1se model has non-zero coefficents.
This limits the output only to those most important variables. For interest, I also
added several larger lambdas. Here is the predictiveness of the model:
What do we learn from the elastic net? Here are a summary of things learned:
Trip distance is the most important variable (unsurprising, perhaps)
The dropoff location effects the output significantly
The hour and day of the week also has an effect
We should be able to get an RMSE of at least around 370 seconds on the validation set
We can repeat this analysis of the best variables with boosted trees. These
may find different variable importance scores because it allows for learned
non-linearities and interactions.
I like to create the datasets and watchlist in a different code chunk to make
the code easier to iterate on in the future. I started the eta at 0.5, then
reduced it to 0.1, and then to 0.01. At 0.01 it seemed to slow down a lot but
still improve. I found that 2000 rounds seemed to be a good limit.
This is a good sign that eta has been tuned correctly and we’ve used
a good number of rounds: the validation RMSE has stopped improving
but has not gotten worse and the training RMSE has continued to
So, once again trip distance is the most important. We see that hour,
dropoff location, and weekday are again important. By using non-linearities,
the location information seems even more important than we previously found.
What have we learned from the boosted trees?
Again, trip distance is the most important variable
Hour, dropoff location, and hour are again the second most important
The boosted trees outperfrom the elastic net by a wide marging;
probable some non-linearities and interactions we need to handle
Should be able to get a test RMSE of at least around 280 seconds
When we find that the elastic net is better or at least nearly as predictive
as the gradient trees there is a lot we can do in terms of building new variables
and interactions. Linear models (remember, the elastic net is a linear model)
have none of these higher-order terms in them and need to be given them
directly in the model matrix.
When we find that the gradient boosted trees are significantly better,
trying to custom build new variables is rarely a good idea. At best, we
will approximate the boosted trees with a linear model. A better approach
is to figure out how to either (1) improve the boosted trees model or
(2) to construct a different model that can be blended with the boosted
trees. The first is not very interesting so let’s try the second.
Generalize additive models
Our best tool for building highly non-linear models are generalized
additive models. Here, I’ll interact the longitude and latitude
variables from both the pickup and dropoff locations, the two
most important time variables, and add a non-linear trip distance
Then, I predict the values and compare the xgboost and elastic net
The predictive power of this model is not as good as the boosted
trees but also does not suffer from the same degree of overfitting.
We can usually do better by averaging together the models that we built
into a single meta-model. I tend to do this with a straightforward
linear or logistic regression fit on only the validation set:
About 85% of the final prediction comes from the boosted trees
and 13% from the gam model.
We see that the validation error has changed only marginally, but
has improved slightly. This new meta-model is also likely more
stable and less prone to model drift and overfitting.
How would the above analysis have changed if I was doing classification
rather than regression? For binary classification, I would not need
to do very much to change the process. Just a few basic argument changes:
set the family = "binomial" option in the glmnet function
set objective = "binary:logistic" in the xgb.train function
set the family = binomial() option in the gam function
blend models using glm with the binomial family
convert predictions to 0’s and 1’s using a cutoff
use accuracy in place of RMSE when evaluating the model
The nature of classification problems is of course often much different.
For example, it is much easier to overfit classification tasks and I
find that glmnet outperforms xgboost far more often in
binary classification tasks.
Multiclass Estimation (small number of cases)
If I have a problem with a small number of categories, say 3-6, I
would modify the general approach as follows:
set the family = "multinomial" option in the glmnet function
fit one-vs-many models using xgb.train
often forgo the gam model; if using, do one-vs-many models
blend models using multinom from the nnet package
convert to class predictions and use accuracy when evaluating
The one-vs-many is generally fine for all of these models and will
not take too long to use.
Multiclass Estimation (larger number of cases)
When I have a larger number cases, it quickly becomes infeasible
to do lots of one-vs-many models, at least at a first take. Of
course, for a major project it can be done but is not a good first
course of action.
If the number of categories does not exceed around a dozen and
I believe the dataset offers a reasonable chance at a linear solution,
my first step would be to fit a glmnet model with a multinomial
family. Hopefully this converges in a reasonable time-frame. I then
look at the confusion matrix to identify particular categories that
are hard to distinguish. If there are a few particularly hard clusters,
I attack each problem as a subproblem using binary classifiers or the
approach above for a small set of 3-6 classes.
If the above approach does not work well or there are a very large
number of categories, I’ll use a dense neural network such as the
one from our last set of lecture notes. I have increasingly gone to
dense neural networks for multiclass problems on the boundary of
these cut-offs (say, 6-10 classes); this is partially due to my
growing comfort with how to train and test neural networks.
Other Directions in Structured Predictive Modelling
We now conclude the portion of the course dedicated to what I refer
to as structured data. That is, data where I essentially give you
the variables we need for modelling as columns in the raw data. Yes,
we sometimes created derived variables using intuition about the
problem or some type of basis expansion. However, for the most part
the way I gave you the data is a natural representation of the the
way models are built for it. Over the remainder of the course we
will focus on unstructured data: text, images, and perhaps even
sound. Here the difficult task of featurization will be the most
important step of model building.
Before we conclude this part of the course, however, I want to stress
two very different messages. First of all, the modelling tools that
we have seen (specifically: the elastic net; gradient boosted trees;
KNNs; GAMs; SVMs; dense neural networks; GLMs) make up a sizeable portion
of what you need to do cutting-edge work in predictive modelling.
Yes, there are many fancy sounding models that you’ll come across
searching across the web, interviewing or jobs, or attending conferences.
And yes, some of these are very important in particular niche areas.
However, for structured data I personally don’t think anything else
outside of this set has proven itself as a crutial preditive tool that
everyone needs to know and use.
At the same time, there is of course a lot more to the field than
what we have already seen. Many of these concern the process of data
cleaning and preparation. This is actually what I spend the majority
of my time doing, in fact. There are also a lot of data quirks that
need addressing such as rare events, non-standard loss functions, and
missing values. Finally, if we want to implement these models in the
wild there are a lot of additional concerns such as how to monitor
the results, how to guard against model drift, and how to avoid