I needed to make some further changes to tmodels. Reinstall with the following code:

# devtools::install_github("statsmaths/tmodels")

Then, read in all of the R packages that we will need for today:

library(dplyr)
library(ggplot2)
library(tmodels)
library(readr)

NBA Dataset

Today we are going to look at a dataset from the NBA. Specifically, trying to see what factors influence whether a shot is missed or made. Run the following code to load the dataset. Do not worry about what the extra lines of code are doing; I am just cleaning the dataset to make it easier to model. You will learn how to do that in the second half of the course.

nba <- read_csv("https://statsmaths.github.io/ml_data/nba_shots.csv")
nba <- filter(nba, !is.na(fgm))
nba <- mutate(nba, shot = if_else(fgm == 0, "Missed", "Made"), period = sprintf("p%d", period), pts_type = if_else(pts_type == 2, "two", "three"))
nba <- select(nba, shot, pts_type, period, shot_clock, dribbles, touch_time, shot_dist, close_def_dist, shooter_height)

You can see the data with this command:

nba
## # A tibble: 16,000 x 9
##    shot  pts_type period shot_clock dribbles touch_time shot_dist
##    <chr> <chr>    <chr>       <dbl>    <dbl>      <dbl>     <dbl>
##  1 Miss… two      p2           18.6        0        0.8      23.7
##  2 Miss… three    p4           11          0        1.2      26.7
##  3 Miss… two      p3            4.9        2        1.8       9.4
##  4 Miss… two      p4            9.9        4        3.9      14.6
##  5 Miss… three    p2           24          2        4.2      35.2
##  6 Miss… three    p3           19.8        0        0.7      22.7
##  7 Miss… two      p4           11.6        2        3.2       3.9
##  8 Miss… three    p1            8.1        0        0.9      22.6
##  9 Made  two      p3            3.6        3        3.2      18.1
## 10 Miss… two      p1           22.9        0        1.4      21.7
## # … with 15,990 more rows, and 2 more variables: close_def_dist <dbl>,
## #   shooter_height <dbl>
  1. Let’s first see how well players are able to make shots based on whether it is two or three point attempt. Build a contingency table of the variables shot (response) and pts_type:
tmod_contingency(shot ~ pts_type, data = nba)
##          Response
## Predictor Made Missed
##     three 1615   2567
##     two   6331   5487
  1. I want you to use of the contingency table tests that we had for the first exam on the shots as a function of points type contingency table. Pick the test that seems the most appropriate and run it below:
tmod_z_test_prop(shot ~ pts_type, data = nba)
## 
## Z-Test for Equality of Proportions (2 groups)
## 
##  H0: true probability of effect is the same between groups
##  HA: true probability of effect is different between groups
## 
##  Test statistic: Z = 16.96
##  P-value: < 2.2e-16
## 
##  Parameter: Pr(Missed|three) - Pr(Missed|two)
##  Point estimate: 0.14953
##  Confidence interval: [0.13225, 0.16681]

Is the test significant? What is the point estimate and what does it mean? Does the sign of the point estimate (negative or positive) make sense?

Yes, it is significant (pvalue: < 2.2e-16). The point estimate gives the increased probability of making a two-point shot compared to a three-point shot. It is positive, which makes sense because two-point shots are easier to make.

  1. Re-do the analysis with logistic regression.
tmod_logistic_regression(shot ~ pts_type, data = nba)
## 
## Logistic regression; Z-Test
## 
##  H0: Difference in conditional log odds is zero
##  HA: Difference in conditional log odds is non-zero
## 
##  Test statistic: Z = -16.513
##  P-value: < 2.2e-16
## 
##  Parameter: LO(shot=Missed|two) - LO(shot=Missed|three)
##             after controlling for -- 
##  Point estimate: -0.60648
##  Confidence interval: [-0.67847, -0.53449]

You should see that the logistic regression flips the classes due to the internal logic of the tmodels package. Notice that the point estimate is different from the T-test and understand why we would not expect these numbers to be the same.

It is different because the point estimate here indicates the increase (decrease here) in the log-odds rather than the probability itself.

  1. Instead of only using shot type, let’s use instead the variable shot_dist and see how it effects the shot type. You could use these two variables with a two-sample T-test like this:
tmod_t_test(shot_dist ~ shot, data = nba)
## 
## Two Sample t-test
## 
##  H0: Difference in true means is zero
##  HA: Difference in true means is non-zero
## 
##  Test statistic: t(15991) = -25.54
##  P-value: < 2.2e-16
## 
##  Parameter: Difference in means (Made - Missed)
##  Point estimate: -3.5366
##  Confidence interval: [-3.8080, -3.2651]

Explain why this test is not appropriate for our application.

It is not appropriate because the test assumes that the shot success causes the shot distance, rather than the other way around.

  1. Use logistic regression to instead investigate how the likelihood of a shot being made changes with shot length:
tmod_logistic_regression(shot ~ shot_dist, data = nba)
## 
## Logistic regression; Z-Test
## 
##  H0: Change in conditional log odds is zero
##  HA: Change in conditional log odds is non-zero
## 
##  Test statistic: Z = 24.78
##  P-value: < 2.2e-16
## 
##  Parameter: change in LO(shot=Missed) for unit change in shot_dist
##             controlling for -- 
##  Point estimate: 0.04526
##  Confidence interval: [0.04168, 0.04884]

Does the sign of the point estimate (negative or positive) make sense given what we are predicting here?

The point estimate is positive, indicating that the shot is more likely to be misseed as we move away from the basket. This seems reasonable because longer baskets are more difficult to make.

  1. Run a logistic regression that predicts whether a shot was made based on the distance to the closest defender.
tmod_logistic_regression(shot ~ close_def_dist, data = nba)
## 
## Logistic regression; Z-Test
## 
##  H0: Change in conditional log odds is zero
##  HA: Change in conditional log odds is non-zero
## 
##  Test statistic: Z = 2.0721
##  P-value: 0.03826
## 
##  Parameter: change in LO(shot=Missed) for unit change in close_def_dist
##             controlling for -- 
##  Point estimate: 0.011757
##  Confidence interval: [0.00063604, 0.02287700]

You should see that the p-value is less than 0.05 but not very small (larger than 0.01, for example). Does the sign of the point estimate make sense to you?

The point estimate is positive, meaning the the probability of shot=Missed increases as the distance to the defender increases. This seems a bit counterintuitive at first; you would think that a closer defender would make it harder to make a shot, not easier. Look to the next two questions for an explanation.

  1. Re-run the logistic regression in question 6 but include shot distance as a nusiance variable:
tmod_logistic_regression(shot ~ close_def_dist + shot_dist, data = nba)
## 
## Logistic regression; Z-Test
## 
##  H0: Change in conditional log odds is zero
##  HA: Change in conditional log odds is non-zero
## 
##  Test statistic: Z = -13.09
##  P-value: < 2.2e-16
## 
##  Parameter: change in LO(shot=Missed) for unit change in close_def_dist
##             controlling for -- shot_dist
##  Point estimate: -0.097292
##  Confidence interval: [-0.111860, -0.082724]

How does this nusiance variable change the p-value and the point estimate? Can you explain / summarize what might be going on here?

The p-value is now very small and the sign of the point estimate has flipped. After accounting for shot distance, the distance of the defender now works they way we would expect: a closer defender makes it harder to make a basket. Putting this together, the univariate analysis was confounded by the shot length; shots that were far from defenders were also very long shots far from the basekt. That is why they looked hard even though there were no nearby defenders.

  1. A friend claims that the shot clock (time left for a player to attempt a shot) is on average half-way finished when the average shot is made. The NBA shot clock lasts 24 seconds. Run a one-sample T-test to test this hypothesis:
tmod_t_test_one(shot_clock ~ 1, data = nba, h0 = 12)
## 
## One Sample t-test
## 
##  H0: Mean is equal to 12.000000
##  HA: Mean is not equal to 12.000000
## 
##  Test statistic: t(15999) = 5.368
##  P-value: 8.072e-08
## 
##  Parameter: Average value
##  Point estimate: 12.248
##  Confidence interval: [12.157, 12.338]

Does the p-value suggest that your friend is likely incorrect? Looking at the point estimate, however, would it be fair to say that your friend is completely mistaken?

The p-value is significant (p 8.1e-08), however the confidence interval goes from 12.1s to 12.3 seconds and therefore the value of 12s is not very far away in practical terms. This was we refer to a statistically significant result that is not practically significant. Generally we need both to hold to say that we have an interesting effect.