Chicago Crimes

Three Types

Today’s lab looks at a data set of reported crimes from the city of Chicago (if you were in my 289 class last semester, this should look familiar). Our prediction task is to predict the type of crime based on the features of the reported incident. To start, we’ll look at a data set that has three crime types: “battery”, “criminal damage”, and “theft”.

set.seed(1)

chicago <- read_csv(file.path("data", "chi_crimes_3.csv")) %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
chicago
## # A tibble: 30,000 x 9
##    crime_type    year month   day  hour location     latitude longitude train_id
##    <chr>        <dbl> <dbl> <dbl> <dbl> <chr>           <dbl>     <dbl> <chr>   
##  1 battery       2001    11    17    18 STREET           41.9     -87.6 valid   
##  2 battery       2005     8     6    11 ALLEY            41.7     -87.6 valid   
##  3 criminal da…  2003     7     9    15 SMALL RETAI…     41.9     -87.6 train   
##  4 theft         2012     3    27    22 RESIDENCE        41.7     -87.6 train   
##  5 criminal da…  2003     9    19    12 HOSPITAL BU…     41.8     -87.6 train   
##  6 criminal da…  2003     1    16    21 OTHER            41.8     -87.6 valid   
##  7 theft         2012    10    19    19 RESIDENCE-G…     41.9     -87.7 train   
##  8 criminal da…  2009     8    24    22 SIDEWALK         41.6     -87.5 valid   
##  9 battery       2013    10     7    20 STREET           41.7     -87.6 train   
## 10 criminal da…  2011     1    12    24 SCHOOL, PUB…     41.9     -87.8 valid   
## # … with 29,990 more rows

Start by creating a model matrix, response vector, and training model matrix and response vector using all of the variables in the data set other than the train_id. This will be fairly large due to the large number of options for the variable location.

mf <- model.frame(crime_type ~ . -1,
                  data = select(chicago, -train_id))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

X_train <- X[chicago$train_id == "train",]
y_train <- y[chicago$train_id == "train"]

Now, fit a cross-validated elastic net model using 3 folds and alpha equal to 0.9.

model <- cv.glmnet(
  X_train, y_train, alpha = 0.9, family = "multinomial", nfolds = 3
)

Next, plot the model using the plot function. Take note of the shape of the curve and the number of included variables (the numbers of the top of the plot) for each value of lambda.

plot(model)

And compute the classification rate for the data set.

chicago %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(pred == crime_type))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.573
## 2 valid         0.583

How does this compare to random guessing (the data are balanced, with an equal number of crimes of each type)? Answer: Random guesing would give an error rate of 33%. Here, we have a rate of 58%. So, it still makes a lot of mistakes, but much better than guessing at random.

Now, build a confusion matrix of the data.

chicago %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  select(pred, crime_type, train_id) %>%
  table()
## , , train_id = train
## 
##                  crime_type
## pred              battery criminal damage theft
##   battery            4186            1034  1336
##   criminal damage     698            2970  1506
##   theft              1156            1999  3195
## 
## , , train_id = valid
## 
##                  crime_type
## pred              battery criminal damage theft
##   battery            2753             671   848
##   criminal damage     452            2051   964
##   theft               755            1275  2151

Focusing on the validation set, are some crimes harder to distinguish than others? Answer Theft and criminal damage seem a bit harder to tell apart than it is distinguish them from battery.

Now, print out the coefficients for a value of lambda that selects about a dozen variables (perhaps try the 21st value of lambda).

temp <- coef(model, s = model$lambda[22])
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 17 x 3 sparse Matrix of class "dgCMatrix"
##                                             battery criminal damage      theft
## (Intercept)                             -9.45473621      18.2320041 -8.7772679
## hour                                     0.02387677       .          .        
## locationALLEY                            1.37446481       .          .        
## locationAPARTMENT                        .                .          0.5997676
## locationCHA HALLWAY/STAIRWELL/ELEVATOR   0.65924022       .          .        
## locationCHA PARKING LOT/GROUNDS          1.83012953       .          .        
## locationDEPARTMENT STORE                 .                1.3660471  .        
## locationDRUG STORE                       .                0.3926257  .        
## locationGROCERY FOOD STORE               .                1.1217340  .        
## locationPOLICE FACILITY/VEH PARKING LOT  0.19075451       .          .        
## locationRESIDENCE                       -0.05973570       .          0.5760911
## locationRESIDENCE-GARAGE                 .                .          0.8696973
## locationSIDEWALK                         2.30905304       .          .        
## locationSMALL RETAIL STORE               .                0.6189959  .        
## locationSTREET                           0.18463824      -0.2967924  .        
## latitude                                 .                1.0842607  .        
## longitude                                .                0.8235652  .

Looking at the selected variables, what locations seem to be most associated with each type of crime? Answer Battery seems to take place in public places such as ALLEY, SIDEWALK, and PARKING LOT. Criminal damage takes place in stores. Theft is more likely in places of residence.

Now, refit the model using alpha equal to 0.1.

model <- cv.glmnet(
  X_train, y_train, alpha = 0.1, family = "multinomial", nfolds = 3
)

Now, plot the model.

plot(model)

Verify that more non-zero components are present at the minimul value of the CV-curve.

Twelve Types

Let’s now load another version of the crimes data, this time with 12 different categories. The data set is large, so in the interest of time we will only use 5% of it for the training set to speed things along.

set.seed(1)

chicago12 <- read_csv(file.path("data", "chi_crimes_12.csv")) %>%
  mutate(train_id = if_else(runif(n()) < 0.05, "train", "valid"))
chicago12
## # A tibble: 120,000 x 9
##    crime_type    year month   day  hour location     latitude longitude train_id
##    <chr>        <dbl> <dbl> <dbl> <dbl> <chr>           <dbl>     <dbl> <chr>   
##  1 robbery       2013     8    20     9 COMMERCIAL …     42.0     -87.7 valid   
##  2 motor vehic…  2001    11    17    18 STREET           41.9     -87.6 valid   
##  3 robbery       2006    12    22    10 OTHER            41.9     -87.7 valid   
##  4 theft         2012     7    27    17 RESIDENCE P…     42.0     -87.7 valid   
##  5 criminal tr…  2002     3     8     2 OTHER            42.0     -87.7 valid   
##  6 deceptive p…  2007     2    10    17 STREET           41.9     -87.7 valid   
##  7 battery       2015     7    28     9 HOTEL/MOTEL      41.9     -87.6 valid   
##  8 other offen…  2005     3    16     7 AIRPORT/AIR…     42.0     -87.9 valid   
##  9 other offen…  2004     5     9     6 COMMERCIAL …     42.0     -87.7 valid   
## 10 motor vehic…  2005     8     6    11 ALLEY            41.7     -87.6 valid   
## # … with 119,990 more rows

Again, build a model matrix and response vector with all of the variables other than train_id.

mf <- model.frame(crime_type ~ . -1,
                  data = select(chicago12, -train_id))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

X_train <- X[chicago12$train_id == "train",]
y_train <- y[chicago12$train_id == "train"]

We are going to again fit a cross-validated elastic net model with alpha 0.9, and three folds. Add the option trace.it = TRUE to have a verbose print out of the model progress (it may take a minute or two depending on your machine).

model <- cv.glmnet(X_train, y_train,
                   alpha = 0.9,
                   family = "multinomial",
                   nfolds = 3,
                   trace.it = FALSE)  # you should set to true; but it does
                                      # not show up correctly in the solutions

Plot the model CV curve. Verify that the curve has a steeper bend for small values of lambda compared to the previous model.

plot(model)

Now, determine the classification rate.

chicago12 %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(pred == crime_type))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.300
## 2 valid         0.287

How does this compare to random guessing (the data are balanced, with an equal number of crimes of each type)? Answer: Random guesing would give an error rate of 1/12 = 8.3%. Here, we have a rate of 28%. So, while it misses more often that it is correct, it is more than three times better than random guessing.

As before, produce a confusion matrix of the model:

chicago12 %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  select(pred, crime_type, train_id) %>%
  table()
## , , train_id = train
## 
##                      crime_type
## pred                  assault battery burglary criminal damage
##   assault                 169       7        6              13
##   battery                   6      16        0               2
##   burglary                 37      70      289               3
##   criminal damage          37     102       12             224
##   criminal trespass        13      47       12               2
##   deceptive practice       12      41      167               0
##   motor vehicle theft       6      35        1               0
##   narcotics                 0       6       12               6
##   other offense            53      42       15              33
##   prostitution             43       4        3              25
##   robbery                 142     116        8             164
##   theft                     6      29        3               3
##                      crime_type
## pred                  criminal trespass deceptive practice motor vehicle theft
##   assault                            21                  1                  11
##   battery                             4                  0                   2
##   burglary                           93                170                 112
##   criminal damage                    16                  0                  14
##   criminal trespass                 136                 24                  71
##   deceptive practice                 70                229                  90
##   motor vehicle theft                71                 19                 120
##   narcotics                           5                  3                  11
##   other offense                      49                  8                  43
##   prostitution                       19                  0                   3
##   robbery                             8                  3                  26
##   theft                               2                  0                  10
##                      crime_type
## pred                  narcotics other offense prostitution robbery theft
##   assault                    14            34           29      16    18
##   battery                     3             2            7       2     6
##   burglary                  144            40          117      65    74
##   criminal damage            79            43           31      94    77
##   criminal trespass          18            10           20      13    48
##   deceptive practice         67             6           41      30    39
##   motor vehicle theft         7             5           15      14    51
##   narcotics                  24            14           12       8    11
##   other offense              40           258           68      39    66
##   prostitution               18            20           76      13    26
##   robbery                   101            60           58     259    89
##   theft                      12            12           10       2    34
## 
## , , train_id = valid
## 
##                      crime_type
## pred                  assault battery burglary criminal damage
##   assault                2768     216      115             279
##   battery                 133     173        4              45
##   burglary                561    1168     5345              36
##   criminal damage         694    1938       98            4471
##   criminal trespass       252     908      163              77
##   deceptive practice      259     727     2929               9
##   motor vehicle theft     143     816       92              42
##   narcotics                48     135      235             145
##   other offense          1131     872      350             681
##   prostitution            842     168       42             374
##   robbery                2564    1933       82            3276
##   theft                    81     431       17              90
##                      crime_type
## pred                  criminal trespass deceptive practice motor vehicle theft
##   assault                           391                 30                 145
##   battery                            73                  1                  11
##   burglary                         2130               3537                2042
##   criminal damage                   242                 20                 156
##   criminal trespass                2305                535                1565
##   deceptive practice               1111               4418                1581
##   motor vehicle theft              1570                508                2199
##   narcotics                         157                 13                 190
##   other offense                     905                345                 923
##   prostitution                      357                 39                 113
##   robbery                           177                 95                 379
##   theft                              88                  2                 183
##                      crime_type
## pred                  narcotics other offense prostitution robbery theft
##   assault                   315           734          576     343   368
##   battery                    55            78          128      36    88
##   burglary                 2521           662         2235    1065  1354
##   criminal damage          1443           930          626    1466  1231
##   criminal trespass         227           274          412     199   857
##   deceptive practice       1317           135          974     671   751
##   motor vehicle theft       164           148          281     184   742
##   narcotics                 339           256          306     126   149
##   other offense             747          4148         1281     470  1295
##   prostitution              353           543         1431     313   309
##   robbery                  1827          1287         1012    4485  1673
##   theft                     165           301          254      87   644

What are some crimes that seem difficult to tell apart on the validation set? Answer: There are many answers here. Battery is usually mis-classified as robbery or burglary. Criminal damage is often classified as robbery. Deceptive practice is often classified as burglary; the same can be said for prostitution.

For each (actual) crime type, compute the error rate for this category on the validation set and arrange the data from the highest classification rate to t