library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)
library(glmnet)

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)

Binomial and Multinomial elastic net

The “g” in glmnet stands for the same type of generalized as in glm. We can fit classification models by adjusting the family function. Let’s use the NBA shot data set as an example. To start, read in the data set and create a model matrix with all of the variables other than the player name.

set.seed(1)
nba <- read_csv("data/nba_shots.csv") %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))

mf <- model.frame(fgm ~ . -1, data = select(nba, -player_name, -train_id))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

X_train <- X[nba$train_id == "train",]
y_train <- y[nba$train_id == "train"]

We can fit a binomial model with glmnet, but it will be easier to generalize to other data sets to use a multinomial regression model. It will work in the case where the variable we are trying to predict can take on more than two categories.

Multinomial regression, as implemented by glmnet, works by fitting a separate logistic regression model, where we have a set of coefficients for each category. Prediction works be applying each of the models and seeing which has the greatest probability.

Here is the syntax to run this model using the elastic net:

model <- cv.glmnet(X_train, y_train, alpha = 0.9, family = "multinomial")

Now, let’s take a look at the non-zero coefficients in the model for a specific value of lambda. The code here is something that we will use many times, changing only the parameter s in the first line.

temp <- coef(model, s = model$lambda[22])
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 6 x 2 sparse Matrix of class "dgCMatrix"
##                            0            1
## (Intercept)     -0.258253574  0.258253574
## shot_clock      -0.007939177  0.007939177
## touch_time       0.006895195 -0.006895195
## shot_dist        0.021474776 -0.021474776
## close_def_dist  -0.021229592  0.021229592
## defender_height  0.001642631 -0.001642631

The first column indicates those features that classify the 0 category (shot missed) and the second features that classify the 1 category (shot made). In this case they are the same magnitude and opposite signs. The different columns will include different information in the case of 3 or more categories, as you will see in the lab for today.

And then, to get the classification rate, we use the predict function and set the type argument equal to “class”:

nba %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(pred == fgm))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.585
## 2 valid         0.589

Scaling and intercepts (some technical details)

As we wrote the lasso, ridge regression, and elastic net the scale of the predictor variables would have a large impact on the model. We did not focus on this because glmnet always scales the columns to have unit variance and zero mean. Generally, we do not have to worry about this and I have rarely found a reason to modify this default behavior.

The elastic net function also puts in a manual intercept for us. The intercept is treated differently because it does not have a penalty. Again, this is almost always the preferred behavior and you will likely have no reason to change it. If you accidentally do put in an intercept into the model, it will be silently ignored (why would the model put a weight on your intercept, which is penalized, rather than the internal one which is not?).

Cross-validation

How does the elastic net determine which value of lambda is the best? It uses a process called cross-validation (that’s where the “cv” comes from) where the training set itself is used to do automated validation of the model.

Cross validation works as follows. Here I am using 10-fold validation, but you can modify to have k-fold validation:

  • start with an initial choice of lambda
  • assign every training observation randomly to one of ten buckets
  • fit a model using only data from buckets 2-10 with the first lambda value. Use this to predict the values from bucket 1.
  • fit a model using only data from buckets 1, and 3-10. Use this to predict the values from bucket 2.
  • repeat eight more time to get predictions for the other buckets
  • you now have a prediction for each point in the training data. Compute the RMSE or other target metric on this set.
  • repeat for all values of lambda (100, by default)

The “best” lambda is the one that is found to minimize the target metric from cross-validation. A final model is built using all of the data and that is the one that we get as an output.

Let’s visualize this by including the player name variable in our model matrix in order to create a model matrix with a large number of variables.

mf <- model.frame(fgm ~ . -1, data = select(nba, -train_id))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

X_train <- X[nba$train_id == "train",]
y_train <- y[nba$train_id == "train"]

dim(X)
## [1] 20000   290

And fit the cross-validated elastic net model:

model <- cv.glmnet(X_train, y_train, alpha = 0.9, family = "multinomial")

Plotting the model visualizes the cross-validation error for each value of lambda (the dot is the median error and the bars are confidence intervals):

plot(model)

We can see the “best” lambda, as well as the other lambda defined by the dashed line above with this code:

c(lambda.min = log(model$lambda.min), lambda.1se = log(model$lambda.1se))
## lambda.min lambda.1se 
##  -5.033121  -4.381885

The second lambda gives a model that is within one standard error of the predictions from the “best” model but often has significantly fewer variables in it. We can see that here:

temp <- coef(model, s = model$lambda.min)
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 69 x 2 sparse Matrix of class "dgCMatrix"
##                                            0             1
## (Intercept)                    -0.9064295163  0.9064295163
## shot_clock                     -0.0085929490  0.0085929490
## touch_time                      0.0133934741 -0.0133934741
## shot_dist                       0.0270490071 -0.0270490071
## close_def_dist                 -0.0355963700  0.0355963700
## shooter_height                  0.0010819269 -0.0010819269
## defender_height                 0.0083705000 -0.0083705000
## player_nameAllen Crabbe        -0.2772686794  0.2772686794
## player_namealonzo gee           0.0590682504 -0.0590682504
## player_nameAmar'e Stoudemire   -0.0249218788  0.0249218788
## player_nameandre drummond       0.0417509683 -0.0417509683
## player_nameanthony davis       -0.1199673167  0.1199673167
## player_namebeno udrih          -0.2077272512  0.2077272512
## player_namebojan bogdanovic     0.0951048641 -0.0951048641
## player_namebrandon bass         0.1078846795 -0.1078846795
## player_namebrandon jennings    -0.0102359954  0.0102359954
## player_namecaron butler        -0.0107015820  0.0107015820
## player_namechandler parsons     0.0113303283 -0.0113303283
## player_namechase budinger       0.1282246727 -0.1282246727
## player_namechris copeland       0.0602635073 -0.0602635073
## player_namechris paul          -0.0896844521  0.0896844521
## player_namecj watson           -0.0925178304  0.0925178304
## player_namecody zeller          0.0553386986 -0.0553386986
## player_namedeandre jordan      -0.0202268907  0.0202268907
## player_namedraymond green       0.0365233558 -0.0365233558
## player_namedwight howard        0.0564231126 -0.0564231126
## player_nameed davis             0.0388712789 -0.0388712789
## player_nameelfrid payton        0.1519223215 -0.1519223215
## player_nameenes kanter          0.0080794288 -0.0080794288
## player_namegerald henderson     0.0650161868 -0.0650161868
## player_namegreivis vasquez      0.1455280993 -0.1455280993
## player_namehedo turkoglu       -0.0127615440  0.0127615440
## player_nameisaiah thomas       -0.1240460309  0.1240460309
## player_namejames johnson       -0.0403155621  0.0403155621
## player_namejason thompson       0.0586974294 -0.0586974294
## player_namejoakim noah          0.0349944342 -0.0349944342
## player_namejohn wall           -0.0872465054  0.0872465054
## player_namejose calderon       -0.0721927306  0.0721927306
## player_namejusuf nurkic         0.0806380624 -0.0806380624
## player_namekendrick perkins     0.0286335154 -0.0286335154
## player_namekj mcdaniels         0.0005850154 -0.0005850154
## player_namekostas papanikolaou  0.0336103873 -0.0336103873
## player_namekyle korver         -0.0571954312  0.0571954312
## player_namekyle lowry          -0.0190762327  0.0190762327
## player_nameluis scola          -0.0171709513  0.0171709513
## player_nameluol deng            0.0363349745 -0.0363349745
## player_namemarco belinelli     -0.2494006885  0.2494006885
## player_namemarcus morris       -0.1069566927  0.1069566927
## player_namemarreese speights   -0.0599150860  0.0599150860
## player_namenate robinson        0.1975586614 -0.1975586614
## player_namenerlens noel         0.0423555992 -0.0423555992
## player_namenikola mirotic       0.1541470498 -0.1541470498
## player_namepau gasol            0.1388953120 -0.1388953120
## player_namepj tucker            0.0046268126 -0.0046268126
## player_nameramon sessions       0.3755329592 -0.3755329592
## player_namerasual butler       -0.1059906160  0.1059906160
## player_nameray mccallum         0.0943366948 -0.0943366948
## player_namerobbie hummel        0.0195621226 -0.0195621226
## player_nameshabazz muhammad     0.0861710487 -0.0861710487
## player_nameshabazz napier       0.0086892870 -0.0086892870
## player_namestephen curry       -0.1472887743  0.1472887743
## player_namesteven adams         0.0193122661 -0.0193122661
## player_namethabo sefolosha      0.3540326755 -0.3540326755
## player_namethaddeus young       0.1493818612 -0.1493818612
## player_nametim duncan          -0.0780314324  0.0780314324
## player_nametobias harris       -0.0592519300  0.0592519300
## player_nametyreke evans         0.0078478446 -0.0078478446
## player_nametyson chandler      -0.1450126013  0.1450126013
## player_namezaza pachulia        0.2116744750 -0.2116744750
temp <- coef(model, s = model$lambda.1se)
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 12 x 2 sparse Matrix of class "dgCMatrix"
##                                       0            1
## (Intercept)                -0.433480391  0.433480391
## shot_clock                 -0.008240871  0.008240871
## touch_time                  0.008436890 -0.008436890
## shot_dist                   0.022946939 -0.022946939
## close_def_dist             -0.025263578  0.025263578
## defender_height             0.003795927 -0.003795927
## player_namebeno udrih      -0.011852240  0.011852240
## player_namemarco belinelli -0.023315290  0.023315290
## player_namepau gasol        0.016617682 -0.016617682
## player_nameramon sessions   0.050269571 -0.050269571
## player_namestephen curry   -0.014096546  0.014096546
## player_namethabo sefolosha  0.026663558 -0.026663558

Often, the best model on new data sets will be the model set with the lambda with a cross-validation loss 1 standard error greater than the minimum value. It is also much easier to interpret the results.