Class 12: Taxi!

2017-10-05

library(readr)
library(ggplot2)
library(dplyr)
library(methods)

Taxi Data

Today I am going to look at a dataset of taxi rides in NYC. For each ride we want to predict how long the journey took:

taxi <- read_csv("https://statsmaths.github.io/ml_data/nyc_taxi.csv")

Notice that we have only a limited amount of information about each ride:

  • time: day, hour, minute, and weekday
  • location: pickup and dropoff latitude / longitude / neighborhood
  • derived variables: trip distance

As way of review, we’ll go over all of the primary estimators we have learned and show how they might be applied to this prediction problem.

One key thing here is to learn how we would approach a problem like this. Models should usually not be applied in the order we learned them.

Model Selection and Baseline

My first step is to figure out which variables might be of interest and to establish a quick baseline for how well I can generally expect to estimate the output. Here we have only 11 variables, and only two are directly categorical. This is a borderline case between using xgboost and glmnet, so let’s use both

To use the either for model selection, I will usually put all of the variables into the model. I leave out only very grainular categorical variables (such as the player name in the NBA shots data).

X <- model.matrix(~ . - 1, data = taxi[,4:14])
y <- taxi$duration

X_train <- X[taxi$train_id == "train",]
y_train <- y[taxi$train_id == "train"]
X_valid <- X[taxi$train_id == "valid",]
y_valid <- y[taxi$train_id == "valid"]

Elastic Net

Now, we can apply the elastic net. Setting alpha equal to 0.9 is a good starting spot.

library(glmnet)
## Warning: package 'glmnet' was built under R version 3.4.2
model <- cv.glmnet(X_train, y_train, alpha = 0.9)
taxi$duration_pred_enet <- predict(model, newx = X)
beta <- coef(model, s = c(model$lambda.min, model$lambda.1se, model$lambda[c(3, 10)]))
beta[beta[,2] != 0,]
## 15 x 4 sparse Matrix of class "dgCMatrix"
##                                 1             2        3         4
## (Intercept)         -2.398902e+05 -85486.551029 744.9057 585.25342
## hour                 7.467159e+00      5.221391   .        .      
## weekday              1.500917e+01      6.565253   .        .      
## dropoff_longitude   -3.590369e+03  -1209.346410   .        .      
## dropoff_latitude     3.586076e+02    -90.571609   .        .      
## trip_distance        2.214617e+02    168.700415  26.1136  89.97271
## pickup_NTACodeMN17   4.846742e+01     10.254655   .        .      
## dropoff_NTACodeBK37  1.009896e+02     32.288590   .        .      
## dropoff_NTACodeBK85  1.453134e+03    794.612358   .        .      
## dropoff_NTACodeBX03 -1.834410e+03    -15.662501   .        .      
## dropoff_NTACodeMN17  1.178671e+02     54.777096   .        .      
## dropoff_NTACodeMN27  1.003461e+02      6.154257   .        .      
## dropoff_NTACodeMN36 -4.067262e+02    -24.203447   .        .      
## dropoff_NTACodeQN70  3.102096e+02     86.367527   .        .      
## dropoff_NTACodeQN98 -3.794986e+02    -53.069475   .        .

Above, I printed out only those rows where the 1se model has non-zero coefficents. This limits the output only to those most important variables. For interest, I also added several larger lambdas. Here is the predictiveness of the model:

sqrt(tapply((taxi$duration_pred_enet - taxi$duration)^2, taxi$train_id, mean))
##     test    train    valid 
##       NA 391.6076 368.3526

What do we learn from the elastic net? Here are a summary of things learned:

  • Trip distance is the most important variable (unsurprising, perhaps)
  • The dropoff location effects the output significantly
  • The hour and day of the week also has an effect
  • We should be able to get an RMSE of at least around 370 seconds on the validation set

Boosted Trees

We can repeat this analysis of the best variables with boosted trees. These may find different variable importance scores because it allows for learned non-linearities and interactions.

library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
data_train <- xgb.DMatrix(data = X_train, label = y_train)
data_valid <- xgb.DMatrix(data = X_valid, label = y_valid)
watchlist <- list(train=data_train, valid=data_valid)

I like to create the datasets and watchlist in a different code chunk to make the code easier to iterate on in the future. I started the eta at 0.5, then reduced it to 0.1, and then to 0.01. At 0.01 it seemed to slow down a lot but still improve. I found that 2000 rounds seemed to be a good limit.

model <- xgb.train(data = data_train,
                 max_depth = 4, eta = 0.01, nthread = 2,
                 nrounds = 2000, objective = "reg:linear",
                 watchlist = watchlist,
                 print_every_n = 100)
## [1]	train-rmse:997.909607	valid-rmse:1000.418945 
## [101]	train-rmse:485.978882	valid-rmse:466.861938 
## [201]	train-rmse:347.463562	valid-rmse:326.457520 
## [301]	train-rmse:310.312622	valid-rmse:296.435303 
## [401]	train-rmse:294.272980	valid-rmse:287.424194 
## [501]	train-rmse:284.407532	valid-rmse:284.137787 
## [601]	train-rmse:277.759247	valid-rmse:281.635223 
## [701]	train-rmse:271.847198	valid-rmse:280.369202 
## [801]	train-rmse:267.359344	valid-rmse:279.314697 
## [901]	train-rmse:263.321503	valid-rmse:278.976807 
## [1001]	train-rmse:258.913574	valid-rmse:278.692444 
## [1101]	train-rmse:255.782822	valid-rmse:278.304840 
## [1201]	train-rmse:253.180908	valid-rmse:278.120331 
## [1301]	train-rmse:250.683762	valid-rmse:278.215790 
## [1401]	train-rmse:247.992340	valid-rmse:278.323151 
## [1501]	train-rmse:245.324371	valid-rmse:278.337646 
## [1601]	train-rmse:242.883987	valid-rmse:278.496765 
## [1701]	train-rmse:240.666687	valid-rmse:278.681610 
## [1801]	train-rmse:238.393677	valid-rmse:278.941254 
## [1901]	train-rmse:236.186127	valid-rmse:279.098358 
## [2000]	train-rmse:233.944153	valid-rmse:279.142578

This is a good sign that eta has been tuned correctly and we’ve used a good number of rounds: the validation RMSE has stopped improving but has not gotten worse and the training RMSE has continued to improve.

importance_matrix <- xgb.importance(model = model)
importance_matrix[,1] <- colnames(X)[as.numeric(importance_matrix[[1]]) + 1]
importance_matrix
##                 Feature         Gain        Cover    Frequency
##  1:       trip_distance 7.347274e-01 1.976832e-01 0.1342155782
##  2:                hour 1.107094e-01 1.059788e-01 0.1328268595
##  3:    dropoff_latitude 3.556750e-02 1.171836e-01 0.1179594004
##  4:             weekday 2.326470e-02 6.521071e-02 0.0675570804
##  5:   dropoff_longitude 2.066604e-02 8.774949e-02 0.1007229506
##  6:    pickup_longitude 1.540740e-02 6.039457e-02 0.0772781113
##  7:                 day 1.245890e-02 4.053316e-02 0.0890005310
##  8:     pickup_latitude 9.635780e-03 6.222644e-02 0.0682514398
##  9:              minute 7.579435e-03 1.723226e-02 0.0513009027
## 10: dropoff_NTACodeMN17 5.073970e-03 7.370486e-03 0.0105787689
## 11: dropoff_NTACodeBK85 2.393900e-03 2.447565e-02 0.0098435649
## 12:  pickup_NTACodeMN17 2.067905e-03 1.789730e-02 0.0084956909
## 13:  pickup_NTACodeMN20 2.004553e-03 5.578095e-03 0.0042886901
## 14: dropoff_NTACodeMN25 1.512587e-03 2.885549e-03 0.0026140587
## 15: dropoff_NTACodeMN22 1.480923e-03 7.908699e-04 0.0043703794
## 16: dropoff_NTACodeMN27 1.440656e-03 1.098573e-02 0.0060041662
## 17:  pickup_NTACodeMN14 1.387781e-03 6.697318e-03 0.0058407875
## 18: dropoff_NTACodeMN15 1.105036e-03 2.816811e-03 0.0059224768
## 19: dropoff_NTACodeMN40 9.345036e-04 4.910282e-03 0.0041661561
## 20: dropoff_NTACodeBK37 8.715255e-04 1.827260e-02 0.0062083895
## 21:  pickup_NTACodeMN21 5.962083e-04 6.197777e-03 0.0051872728
## 22:  pickup_NTACodeMN31 5.486951e-04 1.462892e-04 0.0031858841
## 23: dropoff_NTACodeQN31 5.273821e-04 8.418971e-04 0.0026957481
## 24:  pickup_NTACodeMN24 4.823224e-04 1.653749e-04 0.0021239227
## 25: dropoff_NTACodeQN70 4.605422e-04 8.124247e-03 0.0031450394
## 26: dropoff_NTACodeQN63 4.266210e-04 1.136184e-03 0.0033084181
## 27:  pickup_NTACodeMN12 4.085791e-04 3.112682e-04 0.0007760487
## 28: dropoff_NTACodeBK61 4.069975e-04 1.158437e-03 0.0021239227
## 29:  pickup_NTACodeMN32 3.855854e-04 1.134726e-04 0.0017971654
## 30:  pickup_NTACodeMN40 3.772183e-04 7.266744e-04 0.0028999714
## 31:  pickup_NTACodeMN19 3.641141e-04 1.818144e-04 0.0020422334
## 32: dropoff_NTACodeBX55 3.514838e-04 1.175070e-02 0.0049422048
## 33: dropoff_NTACodeQN98 3.228439e-04 1.097575e-03 0.0024098354
## 34: dropoff_NTACodeBK40 3.121957e-04 1.287640e-02 0.0042070008
## 35: dropoff_NTACodeMN12 3.094016e-04 2.300913e-04 0.0007760487
## 36: dropoff_NTACodeMN50 2.564946e-04 1.275135e-02 0.0041661561
## 37:  pickup_NTACodeMN22 2.394492e-04 3.295017e-03 0.0024098354
## 38: dropoff_NTACodeMN21 2.244741e-04 8.657585e-03 0.0028999714
## 39:  pickup_NTACodeMN04 2.133755e-04 3.325417e-05 0.0011844954
## 40:  pickup_NTACodeMN15 1.888594e-04 6.277182e-03 0.0034717968
## 41: dropoff_NTACodeBK33 1.705349e-04 3.589179e-03 0.0023689907
## 42: dropoff_NTACodeMN19 1.453461e-04 4.494313e-05 0.0009802720
## 43: dropoff_NTACodeMN20 1.436128e-04 1.909468e-03 0.0013887187
## 44:  pickup_NTACodeMN23 1.414159e-04 5.043757e-04 0.0008985827
## 45: dropoff_NTACodeQN20 1.313315e-04 4.044423e-03 0.0019196994
## 46: dropoff_NTACodeBK90 1.262550e-04 9.876133e-03 0.0032267288
## 47:  pickup_NTACodeMN27 1.221060e-04 9.313459e-04 0.0010211167
## 48:  pickup_NTACodeMN25 1.188604e-04 7.562197e-04 0.0010211167
## 49: dropoff_NTACodeMN31 1.102947e-04 5.303373e-03 0.0017563207
## 50: dropoff_NTACodeBK63 1.063520e-04 7.875445e-03 0.0025732141
## 51: dropoff_NTACodeMN11 1.018370e-04 4.350545e-05 0.0005309807
## 52: dropoff_NTACodeMN32 1.005551e-04 1.801892e-04 0.0004901360
## 53: dropoff_NTACodeBK78 8.699454e-05 5.072969e-03 0.0020013887
## 54: dropoff_NTACodeMN35 7.160700e-05 5.534652e-04 0.0005309807
## 55: dropoff_NTACodeMN14 7.130201e-05 6.769598e-05 0.0004492913
## 56: dropoff_NTACodeQN38 7.029865e-05 6.750742e-03 0.0022056121
## 57:  pickup_NTACodeMN99 6.583294e-05 4.874652e-03 0.0015929420
## 58: dropoff_NTACodeMN28 4.203267e-05 1.529567e-04 0.0002450680
## 59:  pickup_NTACodeMN33 3.907128e-05 4.625330e-03 0.0015112527
## 60: dropoff_NTACodeQN25 3.685886e-05 7.167565e-06 0.0001633787
## 61: dropoff_NTACodeMN23 3.517930e-05 1.898154e-05 0.0002450680
## 62: dropoff_NTACodeQN28 3.514668e-05 3.875402e-03 0.0012661847
## 63: dropoff_NTACodeMN24 3.178200e-05 7.396760e-06 0.0017563207
## 64: dropoff_NTACodeBK68 2.867309e-05 3.125392e-03 0.0010211167
## 65: dropoff_NTACodeMN13 2.592691e-05 1.958579e-05 0.0011028060
## 66:  pickup_NTACodeMN13 2.441442e-05 7.792643e-06 0.0006943594
## 67: dropoff_NTACodeBK64 2.196310e-05 2.769097e-05 0.0003267573
## 68: dropoff_NTACodeMN09 2.112555e-05 1.860650e-05 0.0012661847
## 69: dropoff_NTACodeBK73 1.430148e-05 3.167063e-06 0.0003267573
## 70:  pickup_NTACodeMN28 1.359320e-05 1.734592e-04 0.0002042233
## 71: dropoff_NTACodeMN06 1.238708e-05 1.141393e-04 0.0002859127
## 72: dropoff_NTACodeQN29 1.198799e-05 2.771181e-06 0.0002859127
## 73: dropoff_NTACodeBK77 1.108015e-05 1.375172e-03 0.0004492913
## 74: dropoff_NTACodeMN99 9.175352e-06 1.120099e-03 0.0003676020
## 75: dropoff_NTACodeQN17 8.019882e-06 1.333500e-06 0.0003267573
##                 Feature         Gain        Cover    Frequency

So, once again trip distance is the most important. We see that hour, dropoff location, and weekday are again important. By using non-linearities, the location information seems even more important than we previously found.

taxi$duration_pred_xgb <- predict(model, newdata = X)
sqrt(tapply((taxi$duration_pred_xgb - taxi$duration)^2, taxi$train_id, mean))
##     test    train    valid 
##       NA 233.9442 279.1426

What have we learned from the boosted trees?

  • Again, trip distance is the most important variable
  • Hour, dropoff location, and hour are again the second most important
  • The boosted trees outperfrom the elastic net by a wide marging; probable some non-linearities and interactions we need to handle
  • Should be able to get a test RMSE of at least around 280 seconds

What now?

When we find that the elastic net is better or at least nearly as predictive as the gradient trees there is a lot we can do in terms of building new variables and interactions. Linear models (remember, the elastic net is a linear model) have none of these higher-order terms in them and need to be given them directly in the model matrix.

When we find that the gradient boosted trees are significantly better, trying to custom build new variables is rarely a good idea. At best, we will approximate the boosted trees with a linear model. A better approach is to figure out how to either (1) improve the boosted trees model or (2) to construct a different model that can be blended with the boosted trees. The first is not very interesting so let’s try the second.

Generalize additive models

Our best tool for building highly non-linear models are generalized additive models. Here, I’ll interact the longitude and latitude variables from both the pickup and dropoff locations, the two most important time variables, and add a non-linear trip distance term.

library(gam)
## Loading required package: splines
## Loaded gam 1.14-4
model <- gam(duration ~ lo(pickup_longitude, pickup_latitude) +
                        lo(dropoff_longitude, dropoff_latitude) +
                        lo(hour, weekday) +
                        s(trip_distance),
             data = taxi,
             subset = (train_id == "train"))

Then, I predict the values and compare the xgboost and elastic net models:

taxi$duration_pred_gam <- predict(model, newdata = taxi)
sqrt(tapply((taxi$duration_pred_gam - taxi$duration)^2, taxi$train_id, mean))
##     test    train    valid 
##       NA 332.7668 308.2280

The predictive power of this model is not as good as the boosted trees but also does not suffer from the same degree of overfitting.

Blending

We can usually do better by averaging together the models that we built into a single meta-model. I tend to do this with a straightforward linear or logistic regression fit on only the validation set:

model <- lm(duration ~ duration_pred_gam + duration_pred_xgb - 1,
            data = taxi,
            subset = (train_id == "valid"))
summary(model)
## 
## Call:
## lm(formula = duration ~ duration_pred_gam + duration_pred_xgb - 
##     1, data = taxi, subset = (train_id == "valid"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1299.72  -144.45   -24.58   108.47  1764.86 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## duration_pred_gam  0.13425    0.04075   3.295    0.001 ** 
## duration_pred_xgb  0.85604    0.04010  21.348   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 278.2 on 1998 degrees of freedom
## Multiple R-squared:  0.9242,	Adjusted R-squared:  0.9241 
## F-statistic: 1.218e+04 on 2 and 1998 DF,  p-value: < 2.2e-16
taxi$duration_pred <- predict(model, newdata = taxi)

About 85% of the final prediction comes from the boosted trees and 13% from the gam model.

sqrt(tapply((taxi$duration_pred - taxi$duration)^2, taxi$train_id, mean))
##     test    train    valid 
##       NA 243.6894 278.0722

We see that the validation error has changed only marginally, but has improved slightly. This new meta-model is also likely more stable and less prone to model drift and overfitting.

Classification

Binary Classes

How would the above analysis have changed if I was doing classification rather than regression? For binary classification, I would not need to do very much to change the process. Just a few basic argument changes:

  • set the family = "binomial" option in the glmnet function
  • set objective = "binary:logistic" in the xgb.train function
  • set the family = binomial() option in the gam function
  • blend models using glm with the binomial family
  • convert predictions to 0’s and 1’s using a cutoff
  • use accuracy in place of RMSE when evaluating the model

The nature of classification problems is of course often much different. For example, it is much easier to overfit classification tasks and I find that glmnet outperforms xgboost far more often in binary classification tasks.

Multiclass Estimation (small number of cases)

If I have a problem with a small number of categories, say 3-6, I would modify the general approach as follows:

  • set the family = "multinomial" option in the glmnet function
  • fit one-vs-many models using xgb.train
  • often forgo the gam model; if using, do one-vs-many models
  • blend models using multinom from the nnet package
  • convert to class predictions and use accuracy when evaluating the model

The one-vs-many is generally fine for all of these models and will not take too long to use.

Multiclass Estimation (larger number of cases)

When I have a larger number cases, it quickly becomes infeasible to do lots of one-vs-many models, at least at a first take. Of course, for a major project it can be done but is not a good first course of action.

If the number of categories does not exceed around a dozen and I believe the dataset offers a reasonable chance at a linear solution, my first step would be to fit a glmnet model with a multinomial family. Hopefully this converges in a reasonable time-frame. I then look at the confusion matrix to identify particular categories that are hard to distinguish. If there are a few particularly hard clusters, I attack each problem as a subproblem using binary classifiers or the approach above for a small set of 3-6 classes.

If the above approach does not work well or there are a very large number of categories, I’ll use a dense neural network such as the one from our last set of lecture notes. I have increasingly gone to dense neural networks for multiclass problems on the boundary of these cut-offs (say, 6-10 classes); this is partially due to my growing comfort with how to train and test neural networks.

Other Directions in Structured Predictive Modelling

We now conclude the portion of the course dedicated to what I refer to as structured data. That is, data where I essentially give you the variables we need for modelling as columns in the raw data. Yes, we sometimes created derived variables using intuition about the problem or some type of basis expansion. However, for the most part the way I gave you the data is a natural representation of the the way models are built for it. Over the remainder of the course we will focus on unstructured data: text, images, and perhaps even sound. Here the difficult task of featurization will be the most important step of model building.

Before we conclude this part of the course, however, I want to stress two very different messages. First of all, the modelling tools that we have seen (specifically: the elastic net; gradient boosted trees; KNNs; GAMs; SVMs; dense neural networks; GLMs) make up a sizeable portion of what you need to do cutting-edge work in predictive modelling. Yes, there are many fancy sounding models that you’ll come across searching across the web, interviewing or jobs, or attending conferences. And yes, some of these are very important in particular niche areas. However, for structured data I personally don’t think anything else outside of this set has proven itself as a crutial preditive tool that everyone needs to know and use.

At the same time, there is of course a lot more to the field than what we have already seen. Many of these concern the process of data cleaning and preparation. This is actually what I spend the majority of my time doing, in fact. There are also a lot of data quirks that need addressing such as rare events, non-standard loss functions, and missing values. Finally, if we want to implement these models in the wild there are a lot of additional concerns such as how to monitor the results, how to guard against model drift, and how to avoid adversarial attacks.