18. Spatial Data

This set of notes introduces methods for working with spatial data in R. We will see methods for creating spatial data from a data set with longitude and latitude, how to plot spatial data, how to project spatial data, and how to compute spatial summaries.

Spatial Points

Let’s start by reading in the familiar US cities population data. I will take just a single year of data and remove rows without location information.

us <- read_csv("../data/us_city_population.csv") %>%
  filter(year == 2010) %>% 
  filter(!is.na(lon), !is.na(lat))  %>%
  select(city, population, lon, lat, state)

us

## # A tibble: 289 × 5
##    city            population    lon   lat state
##    <chr>                <dbl>  <dbl> <dbl> <chr>
##  1 Anchorage, AK         292. -149.   61.2 AK   
##  2 Birmingham, AL        212.  -86.8  33.5 AL   
##  3 Huntsville, AL        180.  -86.5  34.8 AL   
##  4 Mobile, AL            195.  -88.1  30.7 AL   
##  5 Montgomery, AL        206.  -86.3  32.3 AL   
##  6 Little Rock, AR       194.  -92.4  34.7 AR   
##  7 Chandler, AZ          236. -112.   33.3 AZ   
##  8 Gilbert, AZ           208. -112.   33.3 AZ   
##  9 Glendale, AZ          227. -112.   33.5 AZ   
## 10 Mesa, AZ              439. -112.   33.4 AZ   
## # … with 279 more rows

As we have seen, we can plot

us %>% 
  ggplot(aes(lon, lat)) +
    geom_point()

The map above is not bad, we can do better by turning this into a “proper” geospatial object. The code below creates a spatial points dataset by indicating that the columns “lon” and “lat” contain longitude and latitude records. The CRS code (4326) indicates that these are “raw” longitude and latitude and the remove flag indicates that we do not want to lose all the other columns in the data. You can typically use these same values of these parameters in other tasks.

us_geo <- us %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326, remove = FALSE)

Printing the us_geo data below shows how differs from other tabular datasets we have used this semester. There’s extra metadata attached to the head of the data and a special column called geometry that has the spatial information attached to it.

us_geo

## Simple feature collection with 289 features and 5 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -157.8453 ymin: 21.32585 xmax: -70.93791 ymax: 61.17755
## Geodetic CRS:  WGS 84
## # A tibble: 289 × 6
##    city            population    lon   lat state             geometry
##  * <chr>                <dbl>  <dbl> <dbl> <chr>          <POINT [°]>
##  1 Anchorage, AK         292. -149.   61.2 AK    (-149.2744 61.17755)
##  2 Birmingham, AL        212.  -86.8  33.5 AL    (-86.79905 33.52744)
##  3 Huntsville, AL        180.  -86.5  34.8 AL      (-86.539 34.78427)
##  4 Mobile, AL            195.  -88.1  30.7 AL    (-88.10023 30.66843)
##  5 Montgomery, AL        206.  -86.3  32.3 AL    (-86.26859 32.34625)
##  6 Little Rock, AR       194.  -92.4  34.7 AR    (-92.35856 34.72543)
##  7 Chandler, AZ          236. -112.   33.3 AZ    (-111.8549 33.28287)
##  8 Gilbert, AZ           208. -112.   33.3 AZ    (-111.7422 33.31021)
##  9 Glendale, AZ          227. -112.   33.5 AZ    (-112.1899 33.53311)
## 10 Mesa, AZ              439. -112.   33.4 AZ    (-111.7174 33.40193)
## # … with 279 more rows

In order to plot this data, we will use a special geometry called geom_sf. It knows to use the geometry aesthetic to plot the data.

us_geo %>% 
  ggplot() +
    geom_sf()

The x and y aesthetics do not need to be set for geom_sf, but we can set other aesthetics just as with any set of points. For example, in the code below I have changed the points to be a different (fixed) color and size.

us_geo %>% 
  ggplot() +
    geom_sf(color = "olivedrab", size = 3)

One nice thing that we can do with spatial data is change the projection before plotting. This create a better representation of the curved earth on a flat plot based on the region of the world that we are looking at. To change the projection, we can pipe the data to the function st_transform(). The function takes one argument, a numeric code giving the CRS number of the projection.

Above, you have already saw that 4326 is the projection for longitude and latitude. Below, I will redo your plot of the data (with the default size and color) below, using the CRS projection 5069.

us_geo %>% 
  st_transform(5069) %>%
  ggplot() +
    geom_sf()

There are two other spatial geometries: geom_sf_text and geom_sf_label. They add textual annotations to the plot based on the location of each row. As with geom_sf, you do not need to set the x- and y-aesthetics. However, you do need to specify the label aesthetic. Below, I will a plot using the coordinate system with crs 5069 which uses geom_sf_text (rather than geom_sf) to label the cities. I’ll remove Alaska and Hawaii to make it easier to read.

us_geo %>% 
  filter(!(state %in% c("AK", "HI"))) %>%
  st_transform(5069) %>%
  ggplot() +
    geom_sf_text(aes(label = city)) +
    coord_sf()   # You need to add this to get the correct labels on the x and

                 # y-axis if you use geom_sf_text without geom_sf

Polygons

Spatial data can also associate regions (polygons) with each row rather than a single point. It is less likely that you will create this type of spatial yourself, instead reading the data from a file that is already designed to store spatial information. The file type we will use is called GeoJSON. It has a standard format so we do not need to parse it ourselves. The function read_sf can read geojson, storing the data in R as a spatial data frame. [It can also read many other geospatial formats, such as ESRI shp files.]

state <- read_sf("../data/state.geojson")
state

## Simple feature collection with 50 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -168.1289 ymin: 18.91747 xmax: -66.9824 ymax: 71.29634
## Geodetic CRS:  WGS 84
## # A tibble: 50 × 4
##    name           abb   fips                                        geometry
##    <chr>          <chr> <chr>                             <MULTIPOLYGON [°]>
##  1 Maine          ME    23    (((-70.70382 43.05983, -70.8268 43.12709, -70…
##  2 New Hampshire  NH    33    (((-71.50109 45.01338, -71.40564 45.19814, -7…
##  3 Delaware       DE    10    (((-75.7886 39.7222, -75.61724 39.83423, -75.…
##  4 South Carolina SC    45    (((-83.10861 35.00066, -82.46009 35.17814, -8…
##  5 Nebraska       NE    31    (((-104.0531 43.00059, -102.7921 42.99998, -1…
##  6 Washington     WA    53    (((-117.0324 48.99919, -117.0411 48.1249, -11…
##  7 New Mexico     NM    35    (((-109.0452 36.99908, -108.2494 36.99901, -1…
##  8 South Dakota   SD    46    (((-104.0577 44.99743, -104.0397 45.00055, -1…
##  9 Texas          TX    48    (((-103.0024 36.5004, -102.2505 36.50037, -10…
## 10 California     CA    06    (((-124.2116 41.99846, -123.3476 41.99911, -1…
## # … with 40 more rows

As with spatial points, we can create a reasonable plot by piping the state data in to ggplot and using geom_sf, as we see here:

state %>% 
  ggplot() +
    geom_sf()

It is much more clear with polygons how distorted the default projection makes everything. Here is the plot modified using the projection with CRS 5069.

state %>% 
  st_transform(5069) %>%
  ggplot() +
    geom_sf()

We can modify spatial polygons just as with any other dataset and use geom_sf_text as with the spatial points. Here, we remove the Alaska and Hawaii polygons and add the state abbreviations as labels.

state %>% 
  filter(!(abb %in% c("AK", "HI"))) %>%
  st_transform(5069) %>%
  ggplot() +
    geom_sf() +
    geom_sf_text(aes(label = abb), size = 2)

Representing the spatial regions as polygons is not just useful as a visual aid. We can also compute statistics that describes the regions through numerical summaries. For example, this code inside of a mutate function computes the area of the polygon in square kilometers:

area = as.numeric(st_area(geometry)) / 1e6

Which can use here to compute the smallest and largest states by area:

state %>%
  mutate(area = as.numeric(st_area(geometry)) / 1e6) %>%
  arrange(area)

## Simple feature collection with 50 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -168.1289 ymin: 18.91747 xmax: -66.9824 ymax: 71.29634
## Geodetic CRS:  WGS 84
## # A tibble: 50 × 5
##    name          abb   fips                                  geometry   area
##    <chr>         <chr> <chr>                       <MULTIPOLYGON [°]>  <dbl>
##  1 Rhode Island  RI    44    (((-71.19564 41.67509, -71.13289 41.660…  2761.
##  2 Delaware      DE    10    (((-75.7886 39.7222, -75.61724 39.83423…  5165.
##  3 Connecticut   CT    09    (((-73.48731 42.04964, -73.05329 42.039… 12941.
##  4 Hawaii        HI    15    (((-156.0572 19.74254, -155.9403 19.852… 15260.
##  5 New Jersey    NJ    34    (((-75.50974 39.68611, -75.41506 39.801… 20139.
##  6 Massachusetts MA    25    (((-73.26496 42.74594, -72.45852 42.726… 20600.
##  7 New Hampshire NH    33    (((-71.50109 45.01338, -71.40564 45.198… 23993.
##  8 Vermont       VT    50    (((-73.34312 45.01084, -72.84563 45.016… 24791.
##  9 Maryland      MD    24    (((-79.47666 39.72108, -77.53488 39.720… 26755.
## 10 West Virginia WV    54    (((-82.59367 38.42181, -82.52958 38.405… 62844.
## # … with 40 more rows

One common technique is to add information to a spatial plot of polygons by using the fill aesthetic. Below, for example, we’ll make the color of each state correspond with its area (usually, we’d use something a bit more interesting, as we will see in the next two notebooks).

state %>%
  filter(!(abb %in% c("AK", "HI"))) %>%
  mutate(area = as.numeric(st_area(geometry)) / 1e6) %>%
  st_transform(5069) %>%
  ggplot() +
    geom_sf(aes(fill = area))

The default color scale is awful in general, and particularly bad for maps. Since we need both the x- and y-aesthetics to represent space, color becomes particularly important with spatial applications. Here is a color scale that I find works very well with spatial data:

scale_fill_distiller( trans = “log2”, palette = “Spectral”, guide = “legend”, n.breaks = 10 )

Which we can see in action here:

state %>%
  filter(!(abb %in% c("AK", "HI"))) %>%
  mutate(area = as.numeric(st_area(geometry)) / 1e6) %>%
  st_transform(5069) %>%
  ggplot() +
    geom_sf(aes(fill = area)) +
    scale_fill_distiller(
      trans = "log2", palette = "Spectral", guide = "legend", n.breaks = 10
    )

We will see more applications of these techniques and more summary statistics, as well as a third type of spatial information (lines), in today’s notebook.

Homework

For homework, make your selection of which dataset you want to work with for the final project and be prepared to talk about the challenges (or not) of accessing the data and reading it into R.