19. Spatial Joins

The notes today introduce a few details about joining together spatial data sets. The notes are relatively short but dense, so make sure to read carefully.

Joining with a key

To start, let’s load the us cities data, taking just the year 2010 and cities with non-missing longitude and latitude.

us <- read_csv("../data/us_city_population.csv") %>%
  filter(year == 2010) %>%
  filter(!is.na(lon), !is.na(lat))

us

## # A tibble: 289 × 6
##    city            state  year population    lon   lat
##    <chr>           <chr> <dbl>      <dbl>  <dbl> <dbl>
##  1 Anchorage, AK   AK     2010       292. -149.   61.2
##  2 Birmingham, AL  AL     2010       212.  -86.8  33.5
##  3 Huntsville, AL  AL     2010       180.  -86.5  34.8
##  4 Mobile, AL      AL     2010       195.  -88.1  30.7
##  5 Montgomery, AL  AL     2010       206.  -86.3  32.3
##  6 Little Rock, AR AR     2010       194.  -92.4  34.7
##  7 Chandler, AZ    AZ     2010       236. -112.   33.3
##  8 Gilbert, AZ     AZ     2010       208. -112.   33.3
##  9 Glendale, AZ    AZ     2010       227. -112.   33.5
## 10 Mesa, AZ        AZ     2010       439. -112.   33.4
## # … with 279 more rows

We will also read in the state polygon spatial data:

state <- read_sf("../data/geo_states.geojson")
state

## Simple feature collection with 50 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -168.1289 ymin: 18.91747 xmax: -66.9824 ymax: 71.29634
## Geodetic CRS:  WGS 84
## # A tibble: 50 × 4
##    name           abb   fips                                        geometry
##    <chr>          <chr> <chr>                             <MULTIPOLYGON [°]>
##  1 Maine          ME    23    (((-70.70382 43.05983, -70.8268 43.12709, -70…
##  2 New Hampshire  NH    33    (((-71.50109 45.01338, -71.40564 45.19814, -7…
##  3 Delaware       DE    10    (((-75.7886 39.7222, -75.61724 39.83423, -75.…
##  4 South Carolina SC    45    (((-83.10861 35.00066, -82.46009 35.17814, -8…
##  5 Nebraska       NE    31    (((-104.0531 43.00059, -102.7921 42.99998, -1…
##  6 Washington     WA    53    (((-117.0324 48.99919, -117.0411 48.1249, -11…
##  7 New Mexico     NM    35    (((-109.0452 36.99908, -108.2494 36.99901, -1…
##  8 South Dakota   SD    46    (((-104.0577 44.99743, -104.0397 45.00055, -1…
##  9 Texas          TX    48    (((-103.0024 36.5004, -102.2505 36.50037, -10…
## 10 California     CA    06    (((-124.2116 41.99846, -123.3476 41.99911, -1…
## # … with 40 more rows

Notice that both of these data sets have a state code and therefore it should be possible to join them using an ordinary left_join (or other join function). This works, but there is a catch:

us %>%
  left_join(state, by = c("state" = "abb"))

## # A tibble: 289 × 9
##    city            state  year population    lon   lat name     fips 
##    <chr>           <chr> <dbl>      <dbl>  <dbl> <dbl> <chr>    <chr>
##  1 Anchorage, AK   AK     2010       292. -149.   61.2 Alaska   02   
##  2 Birmingham, AL  AL     2010       212.  -86.8  33.5 Alabama  01   
##  3 Huntsville, AL  AL     2010       180.  -86.5  34.8 Alabama  01   
##  4 Mobile, AL      AL     2010       195.  -88.1  30.7 Alabama  01   
##  5 Montgomery, AL  AL     2010       206.  -86.3  32.3 Alabama  01   
##  6 Little Rock, AR AR     2010       194.  -92.4  34.7 Arkansas 05   
##  7 Chandler, AZ    AZ     2010       236. -112.   33.3 Arizona  04   
##  8 Gilbert, AZ     AZ     2010       208. -112.   33.3 Arizona  04   
##  9 Glendale, AZ    AZ     2010       227. -112.   33.5 Arizona  04   
## 10 Mesa, AZ        AZ     2010       439. -112.   33.4 Arizona  04   
## # … with 279 more rows, and 1 more variable: geometry <MULTIPOLYGON [°]>

Notice that while the output does combine the data, and does contain a geometry column, the data set has lost the metadata indicating that it is in fact a spatial data table. You would get an error, for example, for trying to use geom_sf on the output. There is, however, a simple fix; just add the st_as_sf function as an additional line:

us %>%
  left_join(state, by = c("state" = "abb")) %>%
  st_as_sf()

## Simple feature collection with 289 features and 8 fields (with 1 geometry empty)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -168.1289 ymin: 18.91747 xmax: -69.92826 ymax: 71.29634
## Geodetic CRS:  WGS 84
## # A tibble: 289 × 9
##    city            state  year population    lon   lat name     fips 
##    <chr>           <chr> <dbl>      <dbl>  <dbl> <dbl> <chr>    <chr>
##  1 Anchorage, AK   AK     2010       292. -149.   61.2 Alaska   02   
##  2 Birmingham, AL  AL     2010       212.  -86.8  33.5 Alabama  01   
##  3 Huntsville, AL  AL     2010       180.  -86.5  34.8 Alabama  01   
##  4 Mobile, AL      AL     2010       195.  -88.1  30.7 Alabama  01   
##  5 Montgomery, AL  AL     2010       206.  -86.3  32.3 Alabama  01   
##  6 Little Rock, AR AR     2010       194.  -92.4  34.7 Arkansas 05   
##  7 Chandler, AZ    AZ     2010       236. -112.   33.3 Arizona  04   
##  8 Gilbert, AZ     AZ     2010       208. -112.   33.3 Arizona  04   
##  9 Glendale, AZ    AZ     2010       227. -112.   33.5 Arizona  04   
## 10 Mesa, AZ        AZ     2010       439. -112.   33.4 Arizona  04   
## # … with 279 more rows, and 1 more variable: geometry <MULTIPOLYGON [°]>

And now the spatial polygons have been preserved. Note that this problem only occurs when the geometry exists in the right table, though in my experience this is the most common situation.

Also, note that you need to be careful when combining two spatial data tables with a regular key-based join. It is best to remove one of the geometries prior to the join to avoid issues using the function as_tibble()

Spatial Joins

A more interesting way to join spatial data is through a spatial join. This uses the spatial information itself to combine two datasets. There are several different spatial joins; here we will see how to join a points data set to a polygons data set by identifying which polygon each point appears inside.

As an example, let’s re-load the US cities data set but remove the state information. We will try to recreate this column using just the longitude and latitude.

us <- read_csv("../data/us_city_population.csv") %>%
  filter(year == 2010) %>%
  filter(!is.na(lon), !is.na(lat)) %>%
  select(city, population, lon, lat, state)

us

## # A tibble: 289 × 5
##    city            population    lon   lat state
##    <chr>                <dbl>  <dbl> <dbl> <chr>
##  1 Anchorage, AK         292. -149.   61.2 AK   
##  2 Birmingham, AL        212.  -86.8  33.5 AL   
##  3 Huntsville, AL        180.  -86.5  34.8 AL   
##  4 Mobile, AL            195.  -88.1  30.7 AL   
##  5 Montgomery, AL        206.  -86.3  32.3 AL   
##  6 Little Rock, AR       194.  -92.4  34.7 AR   
##  7 Chandler, AZ          236. -112.   33.3 AZ   
##  8 Gilbert, AZ           208. -112.   33.3 AZ   
##  9 Glendale, AZ          227. -112.   33.5 AZ   
## 10 Mesa, AZ              439. -112.   33.4 AZ   
## # … with 279 more rows

To identify, using the spatial information, which polygon each point exists inside we need to first convert the us data set into a spatial points data table using st_as_sf. We can then combine the two tables using the function spatial_join (this is a smaller wrapper function I wrote that called st_join with some useful post-processing options). No key needs to be specified because the function knows we want to join based on the spatial data.

us %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
  spatial_join(state)

## Simple feature collection with 289 features and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -157.8453 ymin: 21.32585 xmax: -70.93791 ymax: 61.17755
## Geodetic CRS:  WGS 84
## # A tibble: 289 × 9
##    city     popul…¹    lon   lat state             geometry name     abb  
##    <chr>      <dbl>  <dbl> <dbl> <chr>          <POINT [°]> <chr>    <chr>
##  1 Anchora…    292. -149.   61.2 AK    (-149.2744 61.17755) Alaska   AK   
##  2 Birming…    212.  -86.8  33.5 AL    (-86.79905 33.52744) Alabama  AL   
##  3 Huntsvi…    180.  -86.5  34.8 AL      (-86.539 34.78427) Alabama  AL   
##  4 Mobile,…    195.  -88.1  30.7 AL    (-88.10023 30.66843) Alabama  AL   
##  5 Montgom…    206.  -86.3  32.3 AL    (-86.26859 32.34625) Alabama  AL   
##  6 Little …    194.  -92.4  34.7 AR    (-92.35856 34.72543) Arkansas AR   
##  7 Chandle…    236. -112.   33.3 AZ    (-111.8549 33.28287) Arizona  AZ   
##  8 Gilbert…    208. -112.   33.3 AZ    (-111.7422 33.31021) Arizona  AZ   
##  9 Glendal…    227. -112.   33.5 AZ    (-112.1899 33.53311) Arizona  AZ   
## 10 Mesa, AZ    439. -112.   33.4 AZ    (-111.7174 33.40193) Arizona  AZ   
## # … with 279 more rows, 1 more variable: fips <chr>, and abbreviated
## #   variable name ¹population

The resulting data set has metadata joined from the state table with all of the city variables (including the geometry) intact. Copying the same join code, we can then use the geometry to plot the spatial data:

us %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
  spatial_join(state) %>%
  filter(abb %in% c("VA", "NC", "MD")) %>%
  ggplot() +
    geom_sf(aes(color = name))

Note that spatial_join always retains the geometry of the first data set used in the join. So, if we flip the two data sets we will have a spatial dataset containing polygons:

state %>%
  spatial_join(st_as_sf(us, coords = c("lon", "lat"), crs = 4326, remove = FALSE))

## Simple feature collection with 291 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -168.1289 ymin: 18.91747 xmax: -66.9824 ymax: 71.29634
## Geodetic CRS:  WGS 84
## # A tibble: 291 × 9
##    name     abb   fips                   geometry city  popul…¹    lon   lat
##    <chr>    <chr> <chr>        <MULTIPOLYGON [°]> <chr>   <dbl>  <dbl> <dbl>
##  1 Maine    ME    23    (((-70.70382 43.05983, -… <NA>     NA     NA    NA  
##  2 New Ham… NH    33    (((-71.50109 45.01338, -… Manc…   110.   -71.4  43.0
##  3 New Ham… NH    33    (((-71.50109 45.01338, -… Nash…    86.5  -71.5  42.7
##  4 Delaware DE    10    (((-75.7886 39.7222, -75… <NA>     NA     NA    NA  
##  5 South C… SC    45    (((-83.10861 35.00066, -… Char…   120.   -80.0  32.8
##  6 South C… SC    45    (((-83.10861 35.00066, -… Colu…   129.   -80.9  34.0
##  7 Nebraska NE    31    (((-104.0531 43.00059, -… Linc…   258.   -96.7  40.8
##  8 Nebraska NE    31    (((-104.0531 43.00059, -… Omah…   409.   -96.0  41.3
##  9 Washing… WA    53    (((-117.0324 48.99919, -… Bell…   122.  -122.   47.6
## 10 Washing… WA    53    (((-117.0324 48.99919, -… Ever…   103.  -122.   48.0
## # … with 281 more rows, 1 more variable: state <chr>, and abbreviated
## #   variable name ¹population

Most of the time, at least in this class, will find the points are the correct thing to retain. Be careful joining with polygons on the left because you can quickly create very large data sets that may exceed your machine’s memory!

Raster Maps

Another completely different way that we can visualize spatial points is by plotting the longitude and latitude points on top of a fixed map. There are a few R packages that do this. Finding them all a bit buggy, I wrote my own very minimal version. We did not install this at the start of the semester; if you would like to try it you, can download it using the first line in the code below. The second line load the package for use.

#remotes::install_github("statsmaths/ggmaptile", upgrade = "never")
library(ggmaptile)

The ggmaptile package provides a new geometry layer called stat_maptiles. If you add it to a plot with a longitude and latitude, it will automatically grab map images that displays an image of a map corresponding to the points.

us %>% 
  filter(state == "VA") %>%
  ggplot(aes(lon, lat)) +
    stat_maptiles(quiet = TRUE, cache_dir = "cache") +
    geom_point(color = "#fe8019", size = 2) +
    theme_void()       # nice theme for a map

This technique is particularly useful when you are working with data with a small spatial range, such as data points inside of a city.

Homework

In the code above we saw one way of combining a dataset of spatial points with a dataset of spatial polygons. Namely, we associated a point with the polygon that contains it (or a polygon with the points it contains). There are many other types of spatial joins that exist. Try to think up an example of one way we might do a spatial combination of:

A spatial points dataset with another spatial points dataset.
A spatial lines dataset with a spatial lines dataset.
A spatial polygons dataset with another spatial points dataset.
Another way of combining points and polygons than was given above.

Then, try to come up with an application of each of these situations.