# Class 12: Taxi!

### 2017-10-05

## Taxi Data

Today I am going to look at a dataset of taxi rides in NYC. For each ride we want to predict how long the journey took:

Notice that we have only a limited amount of information about each ride:

- time: day, hour, minute, and weekday
- location: pickup and dropoff latitude / longitude / neighborhood
- derived variables: trip distance

As way of review, we’ll go over all of the primary estimators we have learned and show how they might be applied to this prediction problem.

One key thing here is to learn how we would approach a problem like this. Models should usually not be applied in the order we learned them.

## Model Selection and Baseline

My first step is to figure out which variables might be of interest and to
establish a quick baseline for how well I can generally expect to estimate
the output. Here we have only 11 variables, and only two are directly
categorical. This is a borderline case between using **xgboost** and
**glmnet**, so let’s use both

To use the either for model selection, I will usually put all of the variables into the model. I leave out only very grainular categorical variables (such as the player name in the NBA shots data).

### Elastic Net

Now, we can apply the elastic net. Setting alpha equal to 0.9 is a good starting spot.

Above, I printed out only those rows where the 1se model has non-zero coefficents. This limits the output only to those most important variables. For interest, I also added several larger lambdas. Here is the predictiveness of the model:

What do we learn from the elastic net? Here are a summary of things learned:

- Trip distance is the most important variable (unsurprising, perhaps)
- The dropoff location effects the output significantly
- The hour and day of the week also has an effect
- We should be able to get an RMSE of at least around 370 seconds on the validation set

### Boosted Trees

We can repeat this analysis of the best variables with boosted trees. These may find different variable importance scores because it allows for learned non-linearities and interactions.

I like to create the datasets and watchlist in a different code chunk to make the code easier to iterate on in the future. I started the eta at 0.5, then reduced it to 0.1, and then to 0.01. At 0.01 it seemed to slow down a lot but still improve. I found that 2000 rounds seemed to be a good limit.

This is a good sign that eta has been tuned correctly and we’ve used a good number of rounds: the validation RMSE has stopped improving but has not gotten worse and the training RMSE has continued to improve.

So, once again trip distance is the most important. We see that hour, dropoff location, and weekday are again important. By using non-linearities, the location information seems even more important than we previously found.

What have we learned from the boosted trees?

- Again, trip distance is the most important variable
- Hour, dropoff location, and hour are again the second most important
- The boosted trees outperfrom the elastic net by a wide marging; probable some non-linearities and interactions we need to handle
- Should be able to get a test RMSE of at least around 280 seconds

## What now?

When we find that the elastic net is better or at least nearly as predictive as the gradient trees there is a lot we can do in terms of building new variables and interactions. Linear models (remember, the elastic net is a linear model) have none of these higher-order terms in them and need to be given them directly in the model matrix.

When we find that the gradient boosted trees are significantly better, trying to custom build new variables is rarely a good idea. At best, we will approximate the boosted trees with a linear model. A better approach is to figure out how to either (1) improve the boosted trees model or (2) to construct a different model that can be blended with the boosted trees. The first is not very interesting so let’s try the second.

### Generalize additive models

Our best tool for building highly non-linear models are generalized additive models. Here, I’ll interact the longitude and latitude variables from both the pickup and dropoff locations, the two most important time variables, and add a non-linear trip distance term.

Then, I predict the values and compare the xgboost and elastic net models:

The predictive power of this model is not as good as the boosted trees but also does not suffer from the same degree of overfitting.

### Blending

We can usually do better by averaging together the models that we built into a single meta-model. I tend to do this with a straightforward linear or logistic regression fit on only the validation set:

About 85% of the final prediction comes from the boosted trees and 13% from the gam model.

We see that the validation error has changed only marginally, but has improved slightly. This new meta-model is also likely more stable and less prone to model drift and overfitting.

## Classification

### Binary Classes

How would the above analysis have changed if I was doing classification rather than regression? For binary classification, I would not need to do very much to change the process. Just a few basic argument changes:

- set the
`family = "binomial"`

option in the`glmnet`

function - set
`objective = "binary:logistic"`

in the`xgb.train`

function - set the
`family = binomial()`

option in the`gam`

function - blend models using
`glm`

with the binomial family - convert predictions to 0’s and 1’s using a cutoff
- use accuracy in place of RMSE when evaluating the model

The nature of classification problems is of course often much different.
For example, it is much easier to overfit classification tasks and I
find that `glmnet`

outperforms `xgboost`

far more often in
binary classification tasks.

### Multiclass Estimation (small number of cases)

If I have a problem with a small number of categories, say 3-6, I would modify the general approach as follows:

- set the
`family = "multinomial"`

option in the`glmnet`

function - fit one-vs-many models using
`xgb.train`

- often forgo the
`gam`

model; if using, do one-vs-many models - blend models using
`multinom`

from the**nnet**package - convert to class predictions and use accuracy when evaluating the model

The one-vs-many is generally fine for all of these models and will not take too long to use.

### Multiclass Estimation (larger number of cases)

When I have a larger number cases, it quickly becomes infeasible to do lots of one-vs-many models, at least at a first take. Of course, for a major project it can be done but is not a good first course of action.

If the number of categories does not exceed around a dozen and
I believe the dataset offers a reasonable chance at a linear solution,
my first step would be to fit a `glmnet`

model with a multinomial
family. Hopefully this converges in a reasonable time-frame. I then
look at the confusion matrix to identify particular categories that
are hard to distinguish. If there are a few particularly hard clusters,
I attack each problem as a subproblem using binary classifiers or the
approach above for a small set of 3-6 classes.

If the above approach does not work well or there are a very large number of categories, I’ll use a dense neural network such as the one from our last set of lecture notes. I have increasingly gone to dense neural networks for multiclass problems on the boundary of these cut-offs (say, 6-10 classes); this is partially due to my growing comfort with how to train and test neural networks.

## Other Directions in Structured Predictive Modelling

We now conclude the portion of the course dedicated to what I refer
to as *structured* data. That is, data where I essentially give you
the variables we need for modelling as columns in the raw data. Yes,
we sometimes created derived variables using intuition about the
problem or some type of basis expansion. However, for the most part
the way I gave you the data is a natural representation of the the
way models are built for it. Over the remainder of the course we
will focus on unstructured data: *text*, *images*, and perhaps even
*sound*. Here the difficult task of featurization will be the most
important step of model building.

Before we conclude this part of the course, however, I want to stress
two very different messages. First of all, the modelling tools that
we have seen (specifically: the elastic net; gradient boosted trees;
KNNs; GAMs; SVMs; dense neural networks; GLMs) make up a sizeable portion
of what you need to do cutting-edge work in predictive modelling.
Yes, there are **many** fancy sounding models that you’ll come across
searching across the web, interviewing or jobs, or attending conferences.
And yes, some of these are very important in particular niche areas.
However, for structured data I personally don’t think anything else
outside of this set has proven itself as a crutial preditive tool that
everyone needs to know and use.

At the same time, there is of course a lot more to the field than what we have already seen. Many of these concern the process of data cleaning and preparation. This is actually what I spend the majority of my time doing, in fact. There are also a lot of data quirks that need addressing such as rare events, non-standard loss functions, and missing values. Finally, if we want to implement these models in the wild there are a lot of additional concerns such as how to monitor the results, how to guard against model drift, and how to avoid adversarial attacks.