This set of notes introduces methods for working with spatial data in R. We will see methods for creating spatial data from a data set with longitude and latitude, how to plot spatial data, how to project spatial data, and how to compute spatial summaries.
Let’s start by reading in the familiar US cities population data. I will take just a single year of data and remove rows without location information.
<- read_csv("../data/us_city_population.csv") %>%
us filter(year == 2010) %>%
filter(!is.na(lon), !is.na(lat)) %>%
select(city, population, lon, lat, state)
us
## # A tibble: 289 × 5
## city population lon lat state
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Anchorage, AK 292. -149. 61.2 AK
## 2 Birmingham, AL 212. -86.8 33.5 AL
## 3 Huntsville, AL 180. -86.5 34.8 AL
## 4 Mobile, AL 195. -88.1 30.7 AL
## 5 Montgomery, AL 206. -86.3 32.3 AL
## 6 Little Rock, AR 194. -92.4 34.7 AR
## 7 Chandler, AZ 236. -112. 33.3 AZ
## 8 Gilbert, AZ 208. -112. 33.3 AZ
## 9 Glendale, AZ 227. -112. 33.5 AZ
## 10 Mesa, AZ 439. -112. 33.4 AZ
## # … with 279 more rows
As we have seen, we can plot
%>%
us ggplot(aes(lon, lat)) +
geom_point()
The map above is not bad, we can do better by turning this into a “proper” geospatial object. The code below creates a spatial points dataset by indicating that the columns “lon” and “lat” contain longitude and latitude records. The CRS code (4326) indicates that these are “raw” longitude and latitude and the remove flag indicates that we do not want to lose all the other columns in the data. You can typically use these same values of these parameters in other tasks.
<- us %>%
us_geo st_as_sf(coords = c("lon", "lat"), crs = 4326, remove = FALSE)
Printing the us_geo
data below shows how differs from
other tabular datasets we have used this semester. There’s extra
metadata attached to the head of the data and a special column called
geometry
that has the spatial information attached to
it.
us_geo
## Simple feature collection with 289 features and 5 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -157.8453 ymin: 21.32585 xmax: -70.93791 ymax: 61.17755
## Geodetic CRS: WGS 84
## # A tibble: 289 × 6
## city population lon lat state geometry
## * <chr> <dbl> <dbl> <dbl> <chr> <POINT [°]>
## 1 Anchorage, AK 292. -149. 61.2 AK (-149.2744 61.17755)
## 2 Birmingham, AL 212. -86.8 33.5 AL (-86.79905 33.52744)
## 3 Huntsville, AL 180. -86.5 34.8 AL (-86.539 34.78427)
## 4 Mobile, AL 195. -88.1 30.7 AL (-88.10023 30.66843)
## 5 Montgomery, AL 206. -86.3 32.3 AL (-86.26859 32.34625)
## 6 Little Rock, AR 194. -92.4 34.7 AR (-92.35856 34.72543)
## 7 Chandler, AZ 236. -112. 33.3 AZ (-111.8549 33.28287)
## 8 Gilbert, AZ 208. -112. 33.3 AZ (-111.7422 33.31021)
## 9 Glendale, AZ 227. -112. 33.5 AZ (-112.1899 33.53311)
## 10 Mesa, AZ 439. -112. 33.4 AZ (-111.7174 33.40193)
## # … with 279 more rows
In order to plot this data, we will use a special geometry called
geom_sf
. It knows to use the geometry
aesthetic to plot the data.
%>%
us_geo ggplot() +
geom_sf()
The x and y aesthetics do not need to be set for
geom_sf
, but we can set other aesthetics just as with any
set of points. For example, in the code below I have changed the points
to be a different (fixed) color and size.
%>%
us_geo ggplot() +
geom_sf(color = "olivedrab", size = 3)
One nice thing that we can do with spatial data is change the
projection before plotting. This create a better representation of the
curved earth on a flat plot based on the region of the world that we are
looking at. To change the projection, we can pipe the data to the
function st_transform()
. The function takes one argument, a
numeric code giving the CRS number of the projection.
Above, you have already saw that 4326 is the projection for longitude and latitude. Below, I will redo your plot of the data (with the default size and color) below, using the CRS projection 5069.
%>%
us_geo st_transform(5069) %>%
ggplot() +
geom_sf()
There are two other spatial geometries: geom_sf_text
and
geom_sf_label
. They add textual annotations to the plot
based on the location of each row. As with geom_sf
, you do
not need to set the x- and y-aesthetics. However, you do need to specify
the label aesthetic. Below, I will a plot using the coordinate system
with crs 5069 which uses geom_sf_text
(rather than
geom_sf
) to label the cities. I’ll remove Alaska and Hawaii
to make it easier to read.
%>%
us_geo filter(!(state %in% c("AK", "HI"))) %>%
st_transform(5069) %>%
ggplot() +
geom_sf_text(aes(label = city)) +
coord_sf() # You need to add this to get the correct labels on the x and
# y-axis if you use geom_sf_text without geom_sf
Spatial data can also associate regions (polygons) with each row
rather than a single point. It is less likely that you will create this
type of spatial yourself, instead reading the data from a file that is
already designed to store spatial information. The file type we will use
is called GeoJSON. It has a standard format so we do not need to parse
it ourselves. The function read_sf
can read geojson,
storing the data in R as a spatial data frame. [It can also read many
other geospatial formats, such as ESRI shp files.]
<- read_sf("../data/state.geojson")
state state
## Simple feature collection with 50 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -168.1289 ymin: 18.91747 xmax: -66.9824 ymax: 71.29634
## Geodetic CRS: WGS 84
## # A tibble: 50 × 4
## name abb fips geometry
## <chr> <chr> <chr> <MULTIPOLYGON [°]>
## 1 Maine ME 23 (((-70.70382 43.05983, -70.8268 43.12709, -70…
## 2 New Hampshire NH 33 (((-71.50109 45.01338, -71.40564 45.19814, -7…
## 3 Delaware DE 10 (((-75.7886 39.7222, -75.61724 39.83423, -75.…
## 4 South Carolina SC 45 (((-83.10861 35.00066, -82.46009 35.17814, -8…
## 5 Nebraska NE 31 (((-104.0531 43.00059, -102.7921 42.99998, -1…
## 6 Washington WA 53 (((-117.0324 48.99919, -117.0411 48.1249, -11…
## 7 New Mexico NM 35 (((-109.0452 36.99908, -108.2494 36.99901, -1…
## 8 South Dakota SD 46 (((-104.0577 44.99743, -104.0397 45.00055, -1…
## 9 Texas TX 48 (((-103.0024 36.5004, -102.2505 36.50037, -10…
## 10 California CA 06 (((-124.2116 41.99846, -123.3476 41.99911, -1…
## # … with 40 more rows
As with spatial points, we can create a reasonable plot by piping the
state
data in to ggplot
and using
geom_sf
, as we see here:
%>%
state ggplot() +
geom_sf()
It is much more clear with polygons how distorted the default projection makes everything. Here is the plot modified using the projection with CRS 5069.
%>%
state st_transform(5069) %>%
ggplot() +
geom_sf()
We can modify spatial polygons just as with any other dataset and use
geom_sf_text
as with the spatial points. Here, we remove
the Alaska and Hawaii polygons and add the state abbreviations as
labels.
%>%
state filter(!(abb %in% c("AK", "HI"))) %>%
st_transform(5069) %>%
ggplot() +
geom_sf() +
geom_sf_text(aes(label = abb), size = 2)
Representing the spatial regions as polygons is not just useful as a visual aid. We can also compute statistics that describes the regions through numerical summaries. For example, this code inside of a mutate function computes the area of the polygon in square kilometers:
area = as.numeric(st_area(geometry)) / 1e6
Which can use here to compute the smallest and largest states by area:
%>%
state mutate(area = as.numeric(st_area(geometry)) / 1e6) %>%
arrange(area)
## Simple feature collection with 50 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -168.1289 ymin: 18.91747 xmax: -66.9824 ymax: 71.29634
## Geodetic CRS: WGS 84
## # A tibble: 50 × 5
## name abb fips geometry area
## <chr> <chr> <chr> <MULTIPOLYGON [°]> <dbl>
## 1 Rhode Island RI 44 (((-71.19564 41.67509, -71.13289 41.660… 2761.
## 2 Delaware DE 10 (((-75.7886 39.7222, -75.61724 39.83423… 5165.
## 3 Connecticut CT 09 (((-73.48731 42.04964, -73.05329 42.039… 12941.
## 4 Hawaii HI 15 (((-156.0572 19.74254, -155.9403 19.852… 15260.
## 5 New Jersey NJ 34 (((-75.50974 39.68611, -75.41506 39.801… 20139.
## 6 Massachusetts MA 25 (((-73.26496 42.74594, -72.45852 42.726… 20600.
## 7 New Hampshire NH 33 (((-71.50109 45.01338, -71.40564 45.198… 23993.
## 8 Vermont VT 50 (((-73.34312 45.01084, -72.84563 45.016… 24791.
## 9 Maryland MD 24 (((-79.47666 39.72108, -77.53488 39.720… 26755.
## 10 West Virginia WV 54 (((-82.59367 38.42181, -82.52958 38.405… 62844.
## # … with 40 more rows
One common technique is to add information to a spatial plot of polygons by using the fill aesthetic. Below, for example, we’ll make the color of each state correspond with its area (usually, we’d use something a bit more interesting, as we will see in the next two notebooks).
%>%
state filter(!(abb %in% c("AK", "HI"))) %>%
mutate(area = as.numeric(st_area(geometry)) / 1e6) %>%
st_transform(5069) %>%
ggplot() +
geom_sf(aes(fill = area))
The default color scale is awful in general, and particularly bad for maps. Since we need both the x- and y-aesthetics to represent space, color becomes particularly important with spatial applications. Here is a color scale that I find works very well with spatial data:
scale_fill_distiller( trans = “log2”, palette = “Spectral”, guide = “legend”, n.breaks = 10 )
Which we can see in action here:
%>%
state filter(!(abb %in% c("AK", "HI"))) %>%
mutate(area = as.numeric(st_area(geometry)) / 1e6) %>%
st_transform(5069) %>%
ggplot() +
geom_sf(aes(fill = area)) +
scale_fill_distiller(
trans = "log2", palette = "Spectral", guide = "legend", n.breaks = 10
)
We will see more applications of these techniques and more summary statistics, as well as a third type of spatial information (lines), in today’s notebook.
For homework, make your selection of which dataset you want to work with for the final project and be prepared to talk about the challenges (or not) of accessing the data and reading it into R.