Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

We also need another package that was not originally installed. Uncomment the following line, run it, and then re-comment it.

#install.packages("lwgeom")

It does not need to be directly loaded, but is needed “under the hood” by some of the other spatial packages.

French COVID-19 Data

Overview

For our next unit, we will be looking at a collection of spatio-temporal data (that is, data with a time component and spatial component) concerning the ongoing COVID-19 pandemic. We will start by looking at data from France, and then move to data from the United States. Today we will introduce tools for working with spatial data. In the next notebook we will see techniques for working with data containing a date variable.

Here are the three French datasets that we will be working with. They contain spatial information, population metadata, and coronavirus numbers at the level of French départements. These are geographic areas that are an important political entities in France. There are 101 départements; 96 in mainland Europe and 5 overseas (called the DOM, or Départements d’outre-mer).

dept <- read_sf(file.path("data", "france_departement.geojson"))
pop <- read_csv(file.path("data", "france_departement_population.csv"))
covid <- read_csv(file.path("data", "france_departement_covid.csv"))

The coronavirus data is stored with one row for each combination of day and département. We have the cumulative number of people who died in each day from COVID-19, the total number currently in hospital, the total number currently in reanimation (this is similar to ICU, but not exactly equivalent, so I used the french term here), and the cumulative number of newly recovered. Notice that deceased and recovered are the total counts of people who have died or recovered, whereas hospitalised and reanimation are the numbers at that moment of patients in each group. There are columns indicating the number of new hospitalisations and reanimations, but these have many missing data points.

covid
## # A tibble: 19,998 x 9
##    date       departement departement_name deceased hospitalised reanimation
##    <date>     <chr>       <chr>               <dbl>        <dbl>       <dbl>
##  1 2020-03-18 01          Ain                     0            2           0
##  2 2020-03-18 02          Aisne                   9            0           0
##  3 2020-03-18 03          Allier                  0            0           0
##  4 2020-03-18 04          Alpes-de-Haute-…        0            3           1
##  5 2020-03-18 05          Hautes-Alpes            0            8           1
##  6 2020-03-18 06          Alpes-Maritimes         2           25           1
##  7 2020-03-18 07          Ardèche                 0            0           0
##  8 2020-03-18 08          Ardennes                0            0           0
##  9 2020-03-18 09          Ariège                  0            1           1
## 10 2020-03-18 10          Aube                    0            5           0
## # … with 19,988 more rows, and 3 more variables: recovered <dbl>,
## #   hospitalised_new <dbl>, reanimation_new <dbl>

Note that, along with date, either the departement or departement_name can be used as a primary key for the data. You only need one to uniquely describe a location.

Unlike the United States, France collects and publishes very little demographic data about its citizens. One of the few variables we will be able to look at for each département is its population, which is in the following table:

pop
## # A tibble: 101 x 2
##    departement population
##    <chr>            <dbl>
##  1 01              643350
##  2 02              534490
##  3 03              337988
##  4 04              163915
##  5 05              141284
##  6 06             1083310
##  7 07              325712
##  8 08              273579
##  9 09              153153
## 10 10              310020
## # … with 91 more rows

When working with the county-level U.S. data for project 3, you will have more demographic variables to work with.

Working with spatial data

We also have loaded spatial data about each département in France in the form of a simple feature collection. The data was loaded from a “geojson” file: a plain-text, open specification for describing spatial data and associated metadata. Printing out the dataset shows that it is not too different from an “ordinary” table of data:

dept
## Simple feature collection with 101 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -61.80958 ymin: -21.38936 xmax: 55.83668 ymax: 51.089
## geographic CRS: WGS 84
## # A tibble: 101 x 3
##    departement departement_name                                      geometry
##    <chr>       <chr>                                       <MULTIPOLYGON [°]>
##  1 01          Ain                 (((4.78021 46.17668, 4.78024 46.18905, 4.…
##  2 02          Aisne               (((3.17296 50.01131, 3.17382 50.01186, 3.…
##  3 03          Allier              (((3.03207 46.79491, 3.03424 46.7908, 3.0…
##  4 04          Alpes-de-Haute-Pro… (((5.67604 44.19143, 5.67817 44.19051, 5.…
##  5 05          Hautes-Alpes        (((6.26057 45.12685, 6.26417 45.12641, 6.…
##  6 06          Alpes-Maritimes     (((7.06711 43.51365, 7.06665 43.5148, 7.0…
##  7 07          Ardèche             (((4.48313 45.23645, 4.4879 45.23218, 4.4…
##  8 08          Ardennes            (((4.23316 49.95775, 4.2369 49.95858, 4.2…
##  9 09          Ariège              (((1.68842 43.27355, 1.69139 43.27173, 1.…
## 10 10          Aube                (((3.41479 48.39027, 3.41555 48.39373, 3.…
## # … with 91 more rows

Like the pop dataset, there is one row for each département. It has sole extra metadata (printed in a different RStudio window) and a special column called geometry. The geometry holds all of the information indicating where the associate geographic area is on a map.

Most plotting, data manipulation, and modeling functions can be used with a spatial data frame just the same way we used plain data frame. For example, we can do a left join with the population data and slice off the first 96 rows (these are the areas that are in Europe).

dept %>%
  left_join(pop, by = "departement") %>%
  slice(1:96)
## Simple feature collection with 96 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -5.14026 ymin: 41.33363 xmax: 9.55996 ymax: 51.089
## geographic CRS: WGS 84
## # A tibble: 96 x 4
##    departement departement_name                           geometry population
##    <chr>       <chr>                            <MULTIPOLYGON [°]>      <dbl>
##  1 01          Ain                (((4.78021 46.17668, 4.78024 46…     643350
##  2 02          Aisne              (((3.17296 50.01131, 3.17382 50…     534490
##  3 03          Allier             (((3.03207 46.79491, 3.03424 46…     337988
##  4 04          Alpes-de-Haute-Pr… (((5.67604 44.19143, 5.67817 44…     163915
##  5 05          Hautes-Alpes       (((6.26057 45.12685, 6.26417 45…     141284
##  6 06          Alpes-Maritimes    (((7.06711 43.51365, 7.06665 43…    1083310
##  7 07          Ardèche            (((4.48313 45.23645, 4.4879 45.…     325712
##  8 08          Ardennes           (((4.23316 49.95775, 4.2369 49.…     273579
##  9 09          Ariège             (((1.68842 43.27355, 1.69139 43…     153153
## 10 10          Aube               (((3.41479 48.39027, 3.41555 48…     310020
## # … with 86 more rows

Notice that the spatial components of the data frame are still present after joining and slicing the data.

Standard ggplot functions work to visualise the non-spatial components of our spatial data. To show the spatial component we need to use a unique kind of geometry called geom_sf. It will plot the shapes in the dataset (by default from the geometry column) over a map. Here is an example of France using our spatial data:

dept %>%
  slice(1:96) %>%
  ggplot() +
    geom_sf()

We can control the way the map looks by adjusting the aesthetics, just as with any other geometry:

  • color (border color)
  • fill (interior color)
  • size (width of the border)
  • alpha (transparency of the shapes)

Here, we will make the borders very small and show the overall population of each département:

dept %>%
  left_join(pop, by = "departement") %>%
  slice(1:96) %>%
  ggplot() +
    geom_sf(aes(fill = population), color = "black", size = 0.1) +
    scale_fill_viridis_c()