The notes today introduce a few details about joining together spatial data sets. The notes are relatively short but dense, so make sure to read carefully.
To start, let’s load the us cities data, taking just the year 2010 and cities with non-missing longitude and latitude.
<- read_csv("../data/us_city_population.csv") %>%
us filter(year == 2010) %>%
filter(!is.na(lon), !is.na(lat))
us
## # A tibble: 289 × 6
## city state year population lon lat
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Anchorage, AK AK 2010 292. -149. 61.2
## 2 Birmingham, AL AL 2010 212. -86.8 33.5
## 3 Huntsville, AL AL 2010 180. -86.5 34.8
## 4 Mobile, AL AL 2010 195. -88.1 30.7
## 5 Montgomery, AL AL 2010 206. -86.3 32.3
## 6 Little Rock, AR AR 2010 194. -92.4 34.7
## 7 Chandler, AZ AZ 2010 236. -112. 33.3
## 8 Gilbert, AZ AZ 2010 208. -112. 33.3
## 9 Glendale, AZ AZ 2010 227. -112. 33.5
## 10 Mesa, AZ AZ 2010 439. -112. 33.4
## # … with 279 more rows
We will also read in the state polygon spatial data:
<- read_sf("../data/geo_states.geojson")
state state
## Simple feature collection with 50 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -168.1289 ymin: 18.91747 xmax: -66.9824 ymax: 71.29634
## Geodetic CRS: WGS 84
## # A tibble: 50 × 4
## name abb fips geometry
## <chr> <chr> <chr> <MULTIPOLYGON [°]>
## 1 Maine ME 23 (((-70.70382 43.05983, -70.8268 43.12709, -70…
## 2 New Hampshire NH 33 (((-71.50109 45.01338, -71.40564 45.19814, -7…
## 3 Delaware DE 10 (((-75.7886 39.7222, -75.61724 39.83423, -75.…
## 4 South Carolina SC 45 (((-83.10861 35.00066, -82.46009 35.17814, -8…
## 5 Nebraska NE 31 (((-104.0531 43.00059, -102.7921 42.99998, -1…
## 6 Washington WA 53 (((-117.0324 48.99919, -117.0411 48.1249, -11…
## 7 New Mexico NM 35 (((-109.0452 36.99908, -108.2494 36.99901, -1…
## 8 South Dakota SD 46 (((-104.0577 44.99743, -104.0397 45.00055, -1…
## 9 Texas TX 48 (((-103.0024 36.5004, -102.2505 36.50037, -10…
## 10 California CA 06 (((-124.2116 41.99846, -123.3476 41.99911, -1…
## # … with 40 more rows
Notice that both of these data sets have a state code and therefore
it should be possible to join them using an ordinary
left_join
(or other join function). This works, but there
is a catch:
%>%
us left_join(state, by = c("state" = "abb"))
## # A tibble: 289 × 9
## city state year population lon lat name fips
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Anchorage, AK AK 2010 292. -149. 61.2 Alaska 02
## 2 Birmingham, AL AL 2010 212. -86.8 33.5 Alabama 01
## 3 Huntsville, AL AL 2010 180. -86.5 34.8 Alabama 01
## 4 Mobile, AL AL 2010 195. -88.1 30.7 Alabama 01
## 5 Montgomery, AL AL 2010 206. -86.3 32.3 Alabama 01
## 6 Little Rock, AR AR 2010 194. -92.4 34.7 Arkansas 05
## 7 Chandler, AZ AZ 2010 236. -112. 33.3 Arizona 04
## 8 Gilbert, AZ AZ 2010 208. -112. 33.3 Arizona 04
## 9 Glendale, AZ AZ 2010 227. -112. 33.5 Arizona 04
## 10 Mesa, AZ AZ 2010 439. -112. 33.4 Arizona 04
## # … with 279 more rows, and 1 more variable: geometry <MULTIPOLYGON [°]>
Notice that while the output does combine the data, and does contain
a geometry column, the data set has lost the metadata indicating that it
is in fact a spatial data table. You would get an error, for example,
for trying to use geom_sf
on the output. There is, however,
a simple fix; just add the st_as_sf
function as an
additional line:
%>%
us left_join(state, by = c("state" = "abb")) %>%
st_as_sf()
## Simple feature collection with 289 features and 8 fields (with 1 geometry empty)
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -168.1289 ymin: 18.91747 xmax: -69.92826 ymax: 71.29634
## Geodetic CRS: WGS 84
## # A tibble: 289 × 9
## city state year population lon lat name fips
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Anchorage, AK AK 2010 292. -149. 61.2 Alaska 02
## 2 Birmingham, AL AL 2010 212. -86.8 33.5 Alabama 01
## 3 Huntsville, AL AL 2010 180. -86.5 34.8 Alabama 01
## 4 Mobile, AL AL 2010 195. -88.1 30.7 Alabama 01
## 5 Montgomery, AL AL 2010 206. -86.3 32.3 Alabama 01
## 6 Little Rock, AR AR 2010 194. -92.4 34.7 Arkansas 05
## 7 Chandler, AZ AZ 2010 236. -112. 33.3 Arizona 04
## 8 Gilbert, AZ AZ 2010 208. -112. 33.3 Arizona 04
## 9 Glendale, AZ AZ 2010 227. -112. 33.5 Arizona 04
## 10 Mesa, AZ AZ 2010 439. -112. 33.4 Arizona 04
## # … with 279 more rows, and 1 more variable: geometry <MULTIPOLYGON [°]>
And now the spatial polygons have been preserved. Note that this problem only occurs when the geometry exists in the right table, though in my experience this is the most common situation.
Also, note that you need to be careful when combining two spatial
data tables with a regular key-based join. It is best to remove one of
the geometries prior to the join to avoid issues using the function
as_tibble()
A more interesting way to join spatial data is through a spatial join. This uses the spatial information itself to combine two datasets. There are several different spatial joins; here we will see how to join a points data set to a polygons data set by identifying which polygon each point appears inside.
As an example, let’s re-load the US cities data set but remove the state information. We will try to recreate this column using just the longitude and latitude.
<- read_csv("../data/us_city_population.csv") %>%
us filter(year == 2010) %>%
filter(!is.na(lon), !is.na(lat)) %>%
select(city, population, lon, lat, state)
us
## # A tibble: 289 × 5
## city population lon lat state
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Anchorage, AK 292. -149. 61.2 AK
## 2 Birmingham, AL 212. -86.8 33.5 AL
## 3 Huntsville, AL 180. -86.5 34.8 AL
## 4 Mobile, AL 195. -88.1 30.7 AL
## 5 Montgomery, AL 206. -86.3 32.3 AL
## 6 Little Rock, AR 194. -92.4 34.7 AR
## 7 Chandler, AZ 236. -112. 33.3 AZ
## 8 Gilbert, AZ 208. -112. 33.3 AZ
## 9 Glendale, AZ 227. -112. 33.5 AZ
## 10 Mesa, AZ 439. -112. 33.4 AZ
## # … with 279 more rows
To identify, using the spatial information, which polygon each point
exists inside we need to first convert the us
data set into
a spatial points data table using st_as_sf
. We can then
combine the two tables using the function spatial_join
(this is a smaller wrapper function I wrote that called
st_join
with some useful post-processing options). No key
needs to be specified because the function knows we want to join based
on the spatial data.
%>%
us st_as_sf(coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
spatial_join(state)
## Simple feature collection with 289 features and 8 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -157.8453 ymin: 21.32585 xmax: -70.93791 ymax: 61.17755
## Geodetic CRS: WGS 84
## # A tibble: 289 × 9
## city popul…¹ lon lat state geometry name abb
## <chr> <dbl> <dbl> <dbl> <chr> <POINT [°]> <chr> <chr>
## 1 Anchora… 292. -149. 61.2 AK (-149.2744 61.17755) Alaska AK
## 2 Birming… 212. -86.8 33.5 AL (-86.79905 33.52744) Alabama AL
## 3 Huntsvi… 180. -86.5 34.8 AL (-86.539 34.78427) Alabama AL
## 4 Mobile,… 195. -88.1 30.7 AL (-88.10023 30.66843) Alabama AL
## 5 Montgom… 206. -86.3 32.3 AL (-86.26859 32.34625) Alabama AL
## 6 Little … 194. -92.4 34.7 AR (-92.35856 34.72543) Arkansas AR
## 7 Chandle… 236. -112. 33.3 AZ (-111.8549 33.28287) Arizona AZ
## 8 Gilbert… 208. -112. 33.3 AZ (-111.7422 33.31021) Arizona AZ
## 9 Glendal… 227. -112. 33.5 AZ (-112.1899 33.53311) Arizona AZ
## 10 Mesa, AZ 439. -112. 33.4 AZ (-111.7174 33.40193) Arizona AZ
## # … with 279 more rows, 1 more variable: fips <chr>, and abbreviated
## # variable name ¹​population
The resulting data set has metadata joined from the state table with all of the city variables (including the geometry) intact. Copying the same join code, we can then use the geometry to plot the spatial data:
%>%
us st_as_sf(coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
spatial_join(state) %>%
filter(abb %in% c("VA", "NC", "MD")) %>%
ggplot() +
geom_sf(aes(color = name))
Note that spatial_join
always retains the geometry of
the first data set used in the join. So, if we flip the two data sets we
will have a spatial dataset containing polygons:
%>%
state spatial_join(st_as_sf(us, coords = c("lon", "lat"), crs = 4326, remove = FALSE))
## Simple feature collection with 291 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -168.1289 ymin: 18.91747 xmax: -66.9824 ymax: 71.29634
## Geodetic CRS: WGS 84
## # A tibble: 291 × 9
## name abb fips geometry city popul…¹ lon lat
## <chr> <chr> <chr> <MULTIPOLYGON [°]> <chr> <dbl> <dbl> <dbl>
## 1 Maine ME 23 (((-70.70382 43.05983, -… <NA> NA NA NA
## 2 New Ham… NH 33 (((-71.50109 45.01338, -… Manc… 110. -71.4 43.0
## 3 New Ham… NH 33 (((-71.50109 45.01338, -… Nash… 86.5 -71.5 42.7
## 4 Delaware DE 10 (((-75.7886 39.7222, -75… <NA> NA NA NA
## 5 South C… SC 45 (((-83.10861 35.00066, -… Char… 120. -80.0 32.8
## 6 South C… SC 45 (((-83.10861 35.00066, -… Colu… 129. -80.9 34.0
## 7 Nebraska NE 31 (((-104.0531 43.00059, -… Linc… 258. -96.7 40.8
## 8 Nebraska NE 31 (((-104.0531 43.00059, -… Omah… 409. -96.0 41.3
## 9 Washing… WA 53 (((-117.0324 48.99919, -… Bell… 122. -122. 47.6
## 10 Washing… WA 53 (((-117.0324 48.99919, -… Ever… 103. -122. 48.0
## # … with 281 more rows, 1 more variable: state <chr>, and abbreviated
## # variable name ¹​population
Most of the time, at least in this class, will find the points are the correct thing to retain. Be careful joining with polygons on the left because you can quickly create very large data sets that may exceed your machine’s memory!
Another completely different way that we can visualize spatial points is by plotting the longitude and latitude points on top of a fixed map. There are a few R packages that do this. Finding them all a bit buggy, I wrote my own very minimal version. We did not install this at the start of the semester; if you would like to try it you, can download it using the first line in the code below. The second line load the package for use.
#remotes::install_github("statsmaths/ggmaptile", upgrade = "never")
library(ggmaptile)
The ggmaptile package provides a new geometry layer
called stat_maptiles
. If you add it to a plot with a
longitude and latitude, it will automatically grab map images that
displays an image of a map corresponding to the points.
%>%
us filter(state == "VA") %>%
ggplot(aes(lon, lat)) +
stat_maptiles(quiet = TRUE, cache_dir = "cache") +
geom_point(color = "#fe8019", size = 2) +
theme_void() # nice theme for a map
This technique is particularly useful when you are working with data with a small spatial range, such as data points inside of a city.
In the code above we saw one way of combining a dataset of spatial points with a dataset of spatial polygons. Namely, we associated a point with the polygon that contains it (or a polygon with the points it contains). There are many other types of spatial joins that exist. Try to think up an example of one way we might do a spatial combination of:
Then, try to come up with an application of each of these situations.