Today’s lab looks at a data set of reported crimes from the city of Chicago (if you were in my 289 class last semester, this should look familiar). Our prediction task is to predict the type of crime based on the features of the reported incident. To start, we’ll look at a data set that has three crime types: “battery”, “criminal damage”, and “theft”.
set.seed(1)
chicago <- read_csv(file.path("data", "chi_crimes_3.csv")) %>%
mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
chicago
## # A tibble: 30,000 x 9
## crime_type year month day hour location latitude longitude train_id
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 battery 2001 11 17 18 STREET 41.9 -87.6 valid
## 2 battery 2005 8 6 11 ALLEY 41.7 -87.6 valid
## 3 criminal da… 2003 7 9 15 SMALL RETAI… 41.9 -87.6 train
## 4 theft 2012 3 27 22 RESIDENCE 41.7 -87.6 train
## 5 criminal da… 2003 9 19 12 HOSPITAL BU… 41.8 -87.6 train
## 6 criminal da… 2003 1 16 21 OTHER 41.8 -87.6 valid
## 7 theft 2012 10 19 19 RESIDENCE-G… 41.9 -87.7 train
## 8 criminal da… 2009 8 24 22 SIDEWALK 41.6 -87.5 valid
## 9 battery 2013 10 7 20 STREET 41.7 -87.6 train
## 10 criminal da… 2011 1 12 24 SCHOOL, PUB… 41.9 -87.8 valid
## # … with 29,990 more rows
Start by creating a model matrix, response vector, and training model matrix and response vector using all of the variables in the data set other than the train_id
. This will be fairly large due to the large number of options for the variable location
.
mf <- model.frame(crime_type ~ . -1,
data = select(chicago, -train_id))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)
X_train <- X[chicago$train_id == "train",]
y_train <- y[chicago$train_id == "train"]
Now, fit a cross-validated elastic net model using 3 folds and alpha equal to 0.9.
model <- cv.glmnet(
X_train, y_train, alpha = 0.9, family = "multinomial", nfolds = 3
)
Next, plot the model using the plot
function. Take note of the shape of the curve and the number of included variables (the numbers of the top of the plot) for each value of lambda.
plot(model)
And compute the classification rate for the data set.
chicago %>%
mutate(pred = predict(model, newx = X, type = "class")) %>%
group_by(train_id) %>%
summarize(class_rate = mean(pred == crime_type))
## # A tibble: 2 x 2
## train_id class_rate
## <chr> <dbl>
## 1 train 0.573
## 2 valid 0.583
How does this compare to random guessing (the data are balanced, with an equal number of crimes of each type)? Answer: Random guesing would give an error rate of 33%. Here, we have a rate of 58%. So, it still makes a lot of mistakes, but much better than guessing at random.
Now, build a confusion matrix of the data.
chicago %>%
mutate(pred = predict(model, newx = X, type = "class")) %>%
select(pred, crime_type, train_id) %>%
table()
## , , train_id = train
##
## crime_type
## pred battery criminal damage theft
## battery 4186 1034 1336
## criminal damage 698 2970 1506
## theft 1156 1999 3195
##
## , , train_id = valid
##
## crime_type
## pred battery criminal damage theft
## battery 2753 671 848
## criminal damage 452 2051 964
## theft 755 1275 2151
Focusing on the validation set, are some crimes harder to distinguish than others? Answer Theft and criminal damage seem a bit harder to tell apart than it is distinguish them from battery.
Now, print out the coefficients for a value of lambda that selects about a dozen variables (perhaps try the 21st value of lambda).
temp <- coef(model, s = model$lambda[22])
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 17 x 3 sparse Matrix of class "dgCMatrix"
## battery criminal damage theft
## (Intercept) -9.45473621 18.2320041 -8.7772679
## hour 0.02387677 . .
## locationALLEY 1.37446481 . .
## locationAPARTMENT . . 0.5997676
## locationCHA HALLWAY/STAIRWELL/ELEVATOR 0.65924022 . .
## locationCHA PARKING LOT/GROUNDS 1.83012953 . .
## locationDEPARTMENT STORE . 1.3660471 .
## locationDRUG STORE . 0.3926257 .
## locationGROCERY FOOD STORE . 1.1217340 .
## locationPOLICE FACILITY/VEH PARKING LOT 0.19075451 . .
## locationRESIDENCE -0.05973570 . 0.5760911
## locationRESIDENCE-GARAGE . . 0.8696973
## locationSIDEWALK 2.30905304 . .
## locationSMALL RETAIL STORE . 0.6189959 .
## locationSTREET 0.18463824 -0.2967924 .
## latitude . 1.0842607 .
## longitude . 0.8235652 .
Looking at the selected variables, what locations seem to be most associated with each type of crime? Answer Battery seems to take place in public places such as ALLEY, SIDEWALK, and PARKING LOT. Criminal damage takes place in stores. Theft is more likely in places of residence.
Now, refit the model using alpha equal to 0.1.
model <- cv.glmnet(
X_train, y_train, alpha = 0.1, family = "multinomial", nfolds = 3
)
Now, plot the model.
plot(model)
Verify that more non-zero components are present at the minimul value of the CV-curve.
Let’s now load another version of the crimes data, this time with 12 different categories. The data set is large, so in the interest of time we will only use 5% of it for the training set to speed things along.
set.seed(1)
chicago12 <- read_csv(file.path("data", "chi_crimes_12.csv")) %>%
mutate(train_id = if_else(runif(n()) < 0.05, "train", "valid"))
chicago12
## # A tibble: 120,000 x 9
## crime_type year month day hour location latitude longitude train_id
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 robbery 2013 8 20 9 COMMERCIAL … 42.0 -87.7 valid
## 2 motor vehic… 2001 11 17 18 STREET 41.9 -87.6 valid
## 3 robbery 2006 12 22 10 OTHER 41.9 -87.7 valid
## 4 theft 2012 7 27 17 RESIDENCE P… 42.0 -87.7 valid
## 5 criminal tr… 2002 3 8 2 OTHER 42.0 -87.7 valid
## 6 deceptive p… 2007 2 10 17 STREET 41.9 -87.7 valid
## 7 battery 2015 7 28 9 HOTEL/MOTEL 41.9 -87.6 valid
## 8 other offen… 2005 3 16 7 AIRPORT/AIR… 42.0 -87.9 valid
## 9 other offen… 2004 5 9 6 COMMERCIAL … 42.0 -87.7 valid
## 10 motor vehic… 2005 8 6 11 ALLEY 41.7 -87.6 valid
## # … with 119,990 more rows
Again, build a model matrix and response vector with all of the variables other than train_id
.
mf <- model.frame(crime_type ~ . -1,
data = select(chicago12, -train_id))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)
X_train <- X[chicago12$train_id == "train",]
y_train <- y[chicago12$train_id == "train"]
We are going to again fit a cross-validated elastic net model with alpha 0.9, and three folds. Add the option trace.it = TRUE
to have a verbose print out of the model progress (it may take a minute or two depending on your machine).
model <- cv.glmnet(X_train, y_train,
alpha = 0.9,
family = "multinomial",
nfolds = 3,
trace.it = FALSE) # you should set to true; but it does
# not show up correctly in the solutions
Plot the model CV curve. Verify that the curve has a steeper bend for small values of lambda compared to the previous model.
plot(model)
Now, determine the classification rate.
chicago12 %>%
mutate(pred = predict(model, newx = X, type = "class")) %>%
group_by(train_id) %>%
summarize(class_rate = mean(pred == crime_type))
## # A tibble: 2 x 2
## train_id class_rate
## <chr> <dbl>
## 1 train 0.300
## 2 valid 0.287
How does this compare to random guessing (the data are balanced, with an equal number of crimes of each type)? Answer: Random guesing would give an error rate of 1/12 = 8.3%. Here, we have a rate of 28%. So, while it misses more often that it is correct, it is more than three times better than random guessing.
As before, produce a confusion matrix of the model:
chicago12 %>%
mutate(pred = predict(model, newx = X, type = "class")) %>%
select(pred, crime_type, train_id) %>%
table()
## , , train_id = train
##
## crime_type
## pred assault battery burglary criminal damage
## assault 169 7 6 13
## battery 6 16 0 2
## burglary 37 70 289 3
## criminal damage 37 102 12 224
## criminal trespass 13 47 12 2
## deceptive practice 12 41 167 0
## motor vehicle theft 6 35 1 0
## narcotics 0 6 12 6
## other offense 53 42 15 33
## prostitution 43 4 3 25
## robbery 142 116 8 164
## theft 6 29 3 3
## crime_type
## pred criminal trespass deceptive practice motor vehicle theft
## assault 21 1 11
## battery 4 0 2
## burglary 93 170 112
## criminal damage 16 0 14
## criminal trespass 136 24 71
## deceptive practice 70 229 90
## motor vehicle theft 71 19 120
## narcotics 5 3 11
## other offense 49 8 43
## prostitution 19 0 3
## robbery 8 3 26
## theft 2 0 10
## crime_type
## pred narcotics other offense prostitution robbery theft
## assault 14 34 29 16 18
## battery 3 2 7 2 6
## burglary 144 40 117 65 74
## criminal damage 79 43 31 94 77
## criminal trespass 18 10 20 13 48
## deceptive practice 67 6 41 30 39
## motor vehicle theft 7 5 15 14 51
## narcotics 24 14 12 8 11
## other offense 40 258 68 39 66
## prostitution 18 20 76 13 26
## robbery 101 60 58 259 89
## theft 12 12 10 2 34
##
## , , train_id = valid
##
## crime_type
## pred assault battery burglary criminal damage
## assault 2768 216 115 279
## battery 133 173 4 45
## burglary 561 1168 5345 36
## criminal damage 694 1938 98 4471
## criminal trespass 252 908 163 77
## deceptive practice 259 727 2929 9
## motor vehicle theft 143 816 92 42
## narcotics 48 135 235 145
## other offense 1131 872 350 681
## prostitution 842 168 42 374
## robbery 2564 1933 82 3276
## theft 81 431 17 90
## crime_type
## pred criminal trespass deceptive practice motor vehicle theft
## assault 391 30 145
## battery 73 1 11
## burglary 2130 3537 2042
## criminal damage 242 20 156
## criminal trespass 2305 535 1565
## deceptive practice 1111 4418 1581
## motor vehicle theft 1570 508 2199
## narcotics 157 13 190
## other offense 905 345 923
## prostitution 357 39 113
## robbery 177 95 379
## theft 88 2 183
## crime_type
## pred narcotics other offense prostitution robbery theft
## assault 315 734 576 343 368
## battery 55 78 128 36 88
## burglary 2521 662 2235 1065 1354
## criminal damage 1443 930 626 1466 1231
## criminal trespass 227 274 412 199 857
## deceptive practice 1317 135 974 671 751
## motor vehicle theft 164 148 281 184 742
## narcotics 339 256 306 126 149
## other offense 747 4148 1281 470 1295
## prostitution 353 543 1431 313 309
## robbery 1827 1287 1012 4485 1673
## theft 165 301 254 87 644
What are some crimes that seem difficult to tell apart on the validation set? Answer: There are many answers here. Battery is usually mis-classified as robbery or burglary. Criminal damage is often classified as robbery. Deceptive practice is often classified as burglary; the same can be said for prostitution.
For each (actual) crime type, compute the error rate for this category on the validation set and arrange the data from the highest classification rate to t