Today, let’s briefly look at another kind of data and see how the techniques we have been working with could be applied to it. Specifically, can we predict whether an NBA player will make a shot:

nba <- read_csv("../data/nba_shots.csv")
nba
## # A tibble: 20,000 × 11
##      fgm period shot_clock dribbles touch_…¹ shot_…² pts_t…³ close…⁴ shoot…⁵
##    <dbl>  <dbl>      <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1     0      2       18.6        0      0.8    23.7       2     7.9      78
##  2     0      4       11          0      1.2    26.7       3     2.8      77
##  3     0      3        4.9        2      1.8     9.4       2     3        80
##  4     0      4        2          3      3.1    24.3       2     5.8      75
##  5     0      4        9.9        4      3.9    14.6       2     4.3      73
##  6     1      3       12.7        6      6.7     9.1       2     1.6      77
##  7     0      1        3.1        4      5.9     3.5       2     1.6      83
##  8     0      2       24          2      4.2    35.2       3    28        76
##  9     0      3       19.8        0      0.7    22.7       3     5        74
## 10     0      4       11.6        2      3.2     3.9       2     1.2      78
## # … with 19,990 more rows, 2 more variables: defender_height <dbl>,
## #   player_name <chr>, and abbreviated variable names ¹​touch_time,
## #   ²​shot_dist, ³​pts_type, ⁴​close_def_dist, ⁵​shooter_height

As with most data sets that you will find outside of this class, there is no prepopulated train_id. We can add one with the following code, which ensures that the classes missed (0) and made (1) are evenly distributed.

set.seed(1L)

nba <- nba %>%
  group_by(fgm) %>%
  mutate(rnum = runif(n())) %>%
  mutate(train_id = if_else(rnum > quantile(rnum, 0.6), "train", "valid")) %>%
  select(-rnum) %>%
  ungroup()

Since this data includes predefined features, we could just select a few and build an unpenalized logistic regression model from these. This is nice because we can immediately look at the coefficients and interpret them.

model <- glm(
  fgm ~ shot_dist + shooter_height, data = nba,
  family = binomial(),
  subset = (train_id == "train")
)
summary(model)
## 
## Call:
## glm(formula = fgm ~ shot_dist + shooter_height, family = binomial(), 
##     data = nba, subset = (train_id == "train"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4606  -1.1136   0.1328   1.0847   1.5893  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.203770   0.517730   2.325   0.0201 *  
## shot_dist      -0.044979   0.002641 -17.030   <2e-16 ***
## shooter_height -0.007596   0.006435  -1.180   0.2378    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11090  on 7999  degrees of freedom
## Residual deviance: 10784  on 7997  degrees of freedom
## AIC: 10790
## 
## Number of Fisher Scoring iterations: 4

Let’s evaluate the model’s error rate just as we do with our text models:

nba$pred <- predict(model, nba, type = "response")

nba %>%
  mutate(pred_label = if_else(pred > 0.5, 1, 0)) %>%
  group_by(train_id) %>%
  summarize(erate = mean(pred_label != fgm))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0.421
## 2 valid    0.424

Next, we can try to fit a penalized model on all of the variables in the data. To do this, we need a few fancy functions from base R to turn the data into a numeric matrix (the real work is only needed to turn the player names into indicator variables; otherwise we could just use as.matrix):

mf <- model.frame(fgm ~ -1 + ., data = select(nba, -train_id, -pred))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

Notice that because of the indicator variables, the size of the model matrix is much larger than the number of variables in the original data:

dim(X)
## [1] 20000   290

Now, split out the training data:

X_train <- X[nba$train_id == "train", ]
y_train <- y[nba$train_id == "train"]

And then apply the penalized logistic regression function:

library(glmnet)
## Loaded glmnet 4.1-6
model <- cv.glmnet(X_train, y_train, family = "binomial", nfolds = 3L)
nba$pred <- as.numeric(predict(model, X, type = "response"))

As with the textual data, we can look at the coefficients:

beta <- coef(model)
beta[as.numeric(beta) != 0,,drop=FALSE]
## 9 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)               0.880179912
## shot_clock                0.010617732
## touch_time               -0.020299876
## shot_dist                -0.043536633
## close_def_dist            0.037225786
## defender_height          -0.006609074
## player_namecarl landry   -0.147971110
## player_namejames johnson  0.014459143
## player_namejohn wall      0.389031702

And the error rate of the model, which is only slightly better than the model using just two variables:

nba %>%
  mutate(pred_label = if_else(pred > 0.5, 1, 0)) %>%
  group_by(train_id) %>%
  summarize(erate = mean(pred_label != fgm))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0.408
## 2 valid    0.417

Finally, let’s fit a boosted trees model on the data. This requires just a little more setup:

library(xgboost)
X_valid <- X[nba$train_id == "valid", ]
y_valid <- y[nba$train_id == "valid"]

data_train <- xgb.DMatrix(data = X_train, label = y_train)
data_valid <- xgb.DMatrix(data = X_valid, label = y_valid)
watchlist <- list(train=data_train, valid=data_valid)

And then we’ll run it over 400 trees with a small learning rate (0.01):

model <- xgb.train(data = data_train,
                   max_depth = 5,
                   eta = 0.01,
                   nthread = 2,
                   nrounds = 400,
                   objective = "multi:softmax",
                   eval_metric = "mlogloss",
                   watchlist = watchlist,
                   verbose = TRUE,
                   print_every_n = 25,
                   num_class = 2L)
## [1]  train-mlogloss:0.692299 valid-mlogloss:0.692435 
## [26] train-mlogloss:0.675337 valid-mlogloss:0.678891 
## [51] train-mlogloss:0.663577 valid-mlogloss:0.670410 
## [76] train-mlogloss:0.654361 valid-mlogloss:0.665084 
## [101]    train-mlogloss:0.647329 valid-mlogloss:0.661812 
## [126]    train-mlogloss:0.642301 valid-mlogloss:0.659669 
## [151]    train-mlogloss:0.637836 valid-mlogloss:0.658268 
## [176]    train-mlogloss:0.634352 valid-mlogloss:0.657294 
## [201]    train-mlogloss:0.631342 valid-mlogloss:0.656722 
## [226]    train-mlogloss:0.628579 valid-mlogloss:0.656429 
## [251]    train-mlogloss:0.626907 valid-mlogloss:0.656256 
## [276]    train-mlogloss:0.625468 valid-mlogloss:0.656238 
## [301]    train-mlogloss:0.624187 valid-mlogloss:0.656244 
## [326]    train-mlogloss:0.622936 valid-mlogloss:0.656284 
## [351]    train-mlogloss:0.621700 valid-mlogloss:0.656398 
## [376]    train-mlogloss:0.620526 valid-mlogloss:0.656509 
## [400]    train-mlogloss:0.619487 valid-mlogloss:0.656610

The results are, again, only slightly better than the other models:

nba$pred <- predict(model, newdata = X)

nba %>%
  mutate(pred_label = if_else(pred > 0.5, 1, 0)) %>%
  group_by(train_id) %>%
  summarize(erate = mean(pred_label != fgm))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0.351
## 2 valid    0.403

Hopefully this helps illustrate how we use the techniques from this class with other data types. It also hopefully illustrates why working with textual data, where the features are not pre-determined, makes for a more interesting study in this class than other data would. Structured data is, of course, very interesting, but much of this is lost unless you are working with data from a domain in which you are knowledgeable and have concrete research questions for.