Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

Chicago Data

This notebook is intended to get you started thinking about how to work with the various Chicago Datasets. It, and in particular my posted solutions, should be helpful in your analysis.

Load the Data

Let’s load the data that we will be looking at through the remainder of the semester. By default it loads a 10% sample of the data; you can load the full dataset by uncommenting the other code; this requires increasing the resources on RStudio Cloud.

comarea <- read_sf(file.path("data", "chicago_community_areas.geojson"))
ziparea <- read_sf(file.path("data", "zip_codes.geojson"))
socio <- read_csv(file.path("data", "census_socioeconomic.csv"))
medical <- read_csv(file.path("data", "chicago_medical_examiner_cases.csv.gz"))
crimes <- read_rds(file.path("data", "chicago_crimes_2001_2020_sample.rds"))
#crimes <- bind_rows(
#  read_csv(file.path("data", "chicago_crimes_2001_2011.csv.gz")),
#  read_csv(file.path("data", "chicago_crimes_2012_2020.csv.gz"))
#)
schools <- read_sf(file.path("data", "chicago_schools.geojson"))
police <- read_sf(file.path("data", "chicago_police_stations.geojson"))

This time, we will look into the temporal components of the data and see how they can be integrated into the spatial visualisations.

Exploring the Corpus

Univariate Exploration

Let’s start with a few simple things to try to understand the data. Produce a table showing the number of crimes associated with each primary_type. Sort the data from most common to least common. Take a moment to look at the types.

crimes %>%
  group_by(primary_type) %>%
  summarize(sm_count()) %>%
  arrange(desc(count))
## # A tibble: 34 x 2
##    primary_type         count
##    <chr>                <int>
##  1 THEFT               137758
##  2 BATTERY             119608
##  3 CRIMINAL DAMAGE      74558
##  4 NARCOTICS            65958
##  5 ASSAULT              41439
##  6 OTHER OFFENSE        40423
##  7 BURGLARY             37106
##  8 MOTOR VEHICLE THEFT  29904
##  9 DECEPTIVE PRACTICE   26111
## 10 ROBBERY              24614
## # … with 24 more rows

Repeat with the description variable. Notice that there are far more categories here.

crimes %>%
  group_by(description) %>%
  summarize(sm_count()) %>%
  arrange(desc(count))
## # A tibble: 456 x 2
##    description                  count
##    <chr>                        <int>
##  1 SIMPLE                       71810
##  2 DOMESTIC BATTERY SIMPLE      55608
##  3 $500 AND UNDER               50341
##  4 TO VEHICLE                   35688
##  5 TO PROPERTY                  34379
##  6 OVER $500                    32855
##  7 POSS: CANNABIS 30GMS OR LESS 25673
##  8 FORCIBLE ENTRY               24887
##  9 FROM BUILDING                23942
## 10 AUTOMOBILE                   23637
## # … with 446 more rows

And again with location_desc:

crimes %>%
  group_by(location_desc) %>%
  summarize(sm_count()) %>%
  arrange(desc(count))
## # A tibble: 168 x 2
##    location_desc                   count
##    <chr>                           <int>
##  1 STREET                         166957
##  2 RESIDENCE                      107838
##  3 APARTMENT                       71880
##  4 SIDEWALK                        65797
##  5 OTHER                           23987
##  6 PARKING LOT/GARAGE(NON.RESID.)  18073
##  7 ALLEY                           14597
##  8 SCHOOL, PUBLIC, BUILDING        13089
##  9 RESIDENCE-GARAGE                12320
## 10 SMALL RETAIL STORE              12132
## # … with 158 more rows

Spatial Analysis

Now, let’s put a few variables together. Create a plot of the community areas showing the number of crimes per person (perhaps per 1000 people). Note that you should not try to directly merge the spatial data into the crimes. This is too large and will crash R.

crimes %>%
  group_by(comarea) %>%
  summarize(sm_count()) %>%
  inner_join(comarea, by = "comarea") %>%
  inner_join(socio, by = "comarea") %>%
  mutate(crimes_per_person = count / population * 1000) %>%
  st_as_sf() %>%
  st_transform(3436) %>%
  ggplot() +
    geom_sf(aes(fill = crimes_per_person)) +
    scale_fill_distiller(
      trans = "log2", palette = "Spectral", guide = "legend", n.breaks = 10
    ) +
    theme_void()

Repeat the question above with crimes per household. Notice if there are any large differences (in general, you can use either normalization, depending on your preference).

crimes %>%
  group_by(comarea) %>%
  summarize(sm_count()) %>%
  inner_join(comarea, by = "comarea") %>%
  inner_join(socio, by = "comarea") %>%
  mutate(crimes_per_hh = count / num_households * 1000) %>%
  st_as_sf() %>%
  st_transform(3436) %>%
  ggplot() +
    geom_sf(aes(fill = crimes_per_hh)) +
    scale_fill_distiller(
      trans = "log2", palette = "Spectral", guide = "legend", n.breaks = 10
    ) +
    theme_void()