Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

We have one final extra package to install, which will allow us to show map data in R. Uncomment the following line, run the code, and re-comment the line before proceeding.

#remotes::install_github("statsmaths/ggmaptile", upgrade = "never")

Chicago Data

Load the Data

For the last part of the semester we will be looking at a number of open data sets from the City of Chicago. We will load in several here:

comarea <- read_sf(file.path("data", "chicago_community_areas.geojson"))
ziparea <- read_sf(file.path("data", "zip_codes.geojson"))
socio <- read_csv(file.path("data", "census_socioeconomic.csv"))
medical <- read_csv(file.path("data", "chicago_medical_examiner_cases.csv.gz"))

Today, our focus on the spatial components of the dataset. The spatial data contains information about zip codes and community areas. Specifically, we will look at a set of records by the Medican Examiner’s office showing deaths they have investigated:

medical
## # A tibble: 34,566 x 14
##    date_incident_iso   date_death_iso      primary_cause   age gender race 
##    <dttm>              <dttm>              <chr>         <dbl> <chr>  <chr>
##  1 2015-01-01 03:25:00 2015-01-01 03:10:00 HYPERTENSIVE…    50 Male   Black
##  2 2014-12-08 18:37:00 2015-01-01 07:15:00 COMPLICATION…    89 Female White
##  3 2014-12-21 23:13:00 2015-01-01 08:45:00 COMPLICATION…    67 Male   Black
##  4 2015-01-01 09:07:00 2015-01-01 09:20:00 COMPLICATION…    31 Male   White
##  5 2015-01-01 09:54:00 2015-01-01 10:10:00 ASPHYXIATION     25 Male   White
##  6 2015-01-01 10:45:00 2015-01-01 11:19:00 HYPERTENSIVE…    71 Male   White
##  7 2015-01-01 10:43:00 2015-01-01 11:22:00 DIABETIC KET…    61 Female Black
##  8 2015-01-01 12:00:00 2015-01-01 13:18:00 ALCOHOL TOXI…    26 Male   White
##  9 2015-01-01 15:14:00 2015-01-01 15:30:00 ARTERIOSCLER…    56 Male   White
## 10 2015-01-01 15:27:00 2015-01-01 15:45:00 CHRONIC OBST…    81 Female White
## # … with 34,556 more rows, and 8 more variables: latino <lgl>,
## #   gun_related <lgl>, opioid_related <lgl>, cold_related <lgl>,
## #   heat_related <lgl>, lon <dbl>, lat <dbl>, residence_zip <chr>

Most of these variables should be relatively self-explanatory.

Advanced Spatial Data

Fixing Left Key Joins

When working on Project 3, we identified an issue with the way that R works when joining a spatial dataset and a non-spatial dataset. Unless the spatial data comes on the left (e.g., first), the spatial data attributes are lost. This is a problem if you later want to use spatial methods on the data. For example,

medical %>%
  left_join(ziparea, by = c("residence_zip" = "zip"))
## # A tibble: 34,566 x 15
##    date_incident_iso   date_death_iso      primary_cause   age gender race 
##    <dttm>              <dttm>              <chr>         <dbl> <chr>  <chr>
##  1 2015-01-01 03:25:00 2015-01-01 03:10:00 HYPERTENSIVE…    50 Male   Black
##  2 2014-12-08 18:37:00 2015-01-01 07:15:00 COMPLICATION…    89 Female White
##  3 2014-12-21 23:13:00 2015-01-01 08:45:00 COMPLICATION…    67 Male   Black
##  4 2015-01-01 09:07:00 2015-01-01 09:20:00 COMPLICATION…    31 Male   White
##  5 2015-01-01 09:54:00 2015-01-01 10:10:00 ASPHYXIATION     25 Male   White
##  6 2015-01-01 10:45:00 2015-01-01 11:19:00 HYPERTENSIVE…    71 Male   White
##  7 2015-01-01 10:43:00 2015-01-01 11:22:00 DIABETIC KET…    61 Female Black
##  8 2015-01-01 12:00:00 2015-01-01 13:18:00 ALCOHOL TOXI…    26 Male   White
##  9 2015-01-01 15:14:00 2015-01-01 15:30:00 ARTERIOSCLER…    56 Male   White
## 10 2015-01-01 15:27:00 2015-01-01 15:45:00 CHRONIC OBST…    81 Female White
## # … with 34,556 more rows, and 9 more variables: latino <lgl>,
## #   gun_related <lgl>, opioid_related <lgl>, cold_related <lgl>,
## #   heat_related <lgl>, lon <dbl>, lat <dbl>, residence_zip <chr>,
## #   geometry <MULTIPOLYGON [°]>

I have now determined an easy solution. Simply use the function st_as_sf after performing the join. This will recreate the spatial metadata with the correct projection and spatial bounding coordinates:

medical %>%
  left_join(ziparea, by = c("residence_zip" = "zip")) %>%
  st_as_sf()
## Simple feature collection with 34566 features and 14 fields (with 122 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -149.964 ymin: 18.13382 xmax: -65.98956 ymax: 61.14823
## geographic CRS: NAD83
## # A tibble: 34,566 x 15
##    date_incident_iso   date_death_iso      primary_cause   age gender race 
##    <dttm>              <dttm>              <chr>         <dbl> <chr>  <chr>
##  1 2015-01-01 03:25:00 2015-01-01 03:10:00 HYPERTENSIVE…    50 Male   Black
##  2 2014-12-08 18:37:00 2015-01-01 07:15:00 COMPLICATION…    89 Female White
##  3 2014-12-21 23:13:00 2015-01-01 08:45:00 COMPLICATION…    67 Male   Black
##  4 2015-01-01 09:07:00 2015-01-01 09:20:00 COMPLICATION…    31 Male   White
##  5 2015-01-01 09:54:00 2015-01-01 10:10:00 ASPHYXIATION     25 Male   White
##  6 2015-01-01 10:45:00 2015-01-01 11:19:00 HYPERTENSIVE…    71 Male   White
##  7 2015-01-01 10:43:00 2015-01-01 11:22:00 DIABETIC KET…    61 Female Black
##  8 2015-01-01 12:00:00 2015-01-01 13:18:00 ALCOHOL TOXI…    26 Male   White
##  9 2015-01-01 15:14:00 2015-01-01 15:30:00 ARTERIOSCLER…    56 Male   White
## 10 2015-01-01 15:27:00 2015-01-01 15:45:00 CHRONIC OBST…    81 Female White
## # … with 34,556 more rows, and 9 more variables: latino <lgl>,
## #   gun_related <lgl>, opioid_related <lgl>, cold_related <lgl>,
## #   heat_related <lgl>, lon <dbl>, lat <dbl>, residence_zip <chr>,
## #   geometry <MULTIPOLYGON [°]>

This should help avoid the need to create many temporary datasets or to do very large joins of the data.

Creating Spatial Data

The medical records have a longitude and latitude attached to them based on the reported place of death. We can create a spatial object from this using the function st_as_sf. We will make sure that R knows these are longitude and latitude coordinates (crs = 4326) and that the lon/lat coordinates are not removed as columns in the data (remove = FALSE). After doing so, we can plot the data using geom_sf, which benefits from nice axis labels and the ability to do coordinate projections.

medical %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
  ggplot() +
    geom_sf(alpha = 0.01)