Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

#install.packages("RcppRoll")

French COVID-19

Load the Data

We will be using the same datasets from the previous notebooks:

dept <- read_sf(file.path("data", "france_departement.geojson"))
pop <- read_csv(file.path("data", "france_departement_population.csv"))
covid <- read_csv(file.path("data", "france_departement_covid.csv"))

This time, we will look into the temporal components of the data and see how they can be integrated into the spatial visualisations.

Plotting Dates

Notice that the date variable in the covid dataset has a special data type called “date”. One some reflection, it seems reasonable to have a seperate data type for dates because they act somewhat like a number (ordered, some meaning of distances) and somewhat like a categorical variable (fixed values). In this section we will explore how to work with these data objects.

covid
## # A tibble: 19,998 x 9
##    date       departement departement_name deceased hospitalised reanimation
##    <date>     <chr>       <chr>               <dbl>        <dbl>       <dbl>
##  1 2020-03-18 01          Ain                     0            2           0
##  2 2020-03-18 02          Aisne                   9            0           0
##  3 2020-03-18 03          Allier                  0            0           0
##  4 2020-03-18 04          Alpes-de-Haute-…        0            3           1
##  5 2020-03-18 05          Hautes-Alpes            0            8           1
##  6 2020-03-18 06          Alpes-Maritimes         2           25           1
##  7 2020-03-18 07          Ardèche                 0            0           0
##  8 2020-03-18 08          Ardennes                0            0           0
##  9 2020-03-18 09          Ariège                  0            1           1
## 10 2020-03-18 10          Aube                    0            5           0
## # … with 19,988 more rows, and 3 more variables: recovered <dbl>,
## #   hospitalised_new <dbl>, reanimation_new <dbl>

We have already worked a bit for date objects in plotting. When used as the x-axis (or less commonly, y-axis), a date object will display okay with no additional work on our part.

covid %>%
  filter(departement == 69) %>%
  ggplot(aes(date, deceased)) +
    geom_line()

Often with dates, it is useful to manually modify the labels on the time-based axis. We do this with scale_x_date (or its y-equivalent if time is on the y-axis). Adding this scale to a plot on its own will have no effect, but we can change the default by changing three parameters:

  • date_breaks a string describing the frequency of the labels, such as “month” or “2 years”
  • date_minor_breaks a string describing the frequency of the grid-lines
  • date_labels format string: strptime

You can set just a subset of these, depending on how much control you want over the plot. For example, let’s label the previous plot by showing a label for every month, and using the pattern “%B” to show (just) the full month’s name.

covid %>%
  filter(departement == 69) %>%
  ggplot(aes(date, deceased)) +
    geom_line() +
    scale_x_date(date_breaks = "month", date_labels = "%B", date_minor_breaks = "month")

The R programming language is designed to be used in English, but often the output of plots will be distributed in other languages. You will note in the previous plot that the month names are in English. How do we change this? We can change the locale used within R by using the Sys.setlocale function. This data comes from France, so let’s change our locale to French:

Sys.setlocale(locale = "fr_FR.UTF-8")
## [1] "fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8"

Running the plot again shows the month names in French:

covid %>%
  filter(departement == 69) %>%
  ggplot(aes(date, deceased)) +
    geom_line() +
    scale_x_date(date_breaks = "month", date_labels = "%B")

For the rest of the notebook, we will move back to English:

Sys.setlocale(locale = "en_US.UTF-8")
## [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/fr_FR.UTF-8"

You will note that I actually included this line the header of the notebook in order to make sure that the notebook is running in an English/US locale. On RStudio Cloud that should be the default, but on your personal machine this may be different.

Working with date variables

There are a number of functions that allow us to manipulate date objects, most frequently to extract a particular element of the date and save it as a new variable within a call to mutate. For example, year(), month() and day() return a numeric version of the corresponding date components:

covid %>%
  mutate(year = year(date), month = month(date), day = day(date)) %>%
  select(date, year, month, day) %>%
  unique()
## # A tibble: 198 x 4
##    date        year month   day
##    <date>     <dbl> <dbl> <int>
##  1 2020-03-18  2020     3    18
##  2 2020-03-19  2020     3    19
##  3 2020-03-20  2020     3    20
##  4 2020-03-21  2020     3    21
##  5 2020-03-22  2020     3    22
##  6 2020-03-23  2020     3    23
##  7 2020-03-24  2020     3    24
##  8 2020-03-25  2020     3    25
##  9 2020-03-26  2020     3    26
## 10 2020-03-27  2020     3    27
## # … with 188 more rows

We can also do mathematical operations on dates; adding or subtracting an integer changes the date by the specified amount of days:

covid %>%
  mutate(date_next = date + 2) %>%
  select(date, date_next) %>%
  unique()
## # A tibble: 198 x 2
##    date       date_next 
##    <date>     <date>    
##  1 2020-03-18 2020-03-20
##  2 2020-03-19 2020-03-21
##  3 2020-03-20 2020-03-22
##  4 2020-03-21 2020-03-23
##  5 2020-03-22 2020-03-24
##  6 2020-03-23 2020-03-25
##  7 2020-03-24 2020-03-26
##  8 2020-03-25 2020-03-27
##  9 2020-03-26 2020-03-28
## 10 2020-03-27 2020-03-29
## # … with 188 more rows

It can also be useful to create a temporary date object, for example to select a subset of days. This is done with one of two functions, the ymd function, which takes a single string formatted as “YYYY-MM-DD”:

covid %>%
  filter(between(date, ymd("2020-07-02"), ymd("2020-07-07"))) %>%
  select(date) %>%
  unique()
## # A tibble: 6 x 1
##   date      
##   <date>    
## 1 2020-07-02
## 2 2020-07-03
## 3 2020-07-04
## 4 2020-07-05
## 5 2020-07-06
## 6 2020-07-07

Or the make_date function, which takes seperate numeric arguments for the year, month, and date. Note that the year, month, and day default to 1970, 1, and 1, respectively.

covid %>%
  filter(between(date, make_date(2020, 07), make_date(2020, 07, 15))) %>%
  select(date) %>%
  unique()
## # A tibble: 15 x 1
##    date      
##    <date>    
##  1 2020-07-01
##  2 2020-07-02
##  3 2020-07-03
##  4 2020-07-04
##  5 2020-07-05
##  6 2020-07-06
##  7 2020-07-07
##  8 2020-07-08
##  9 2020-07-09
## 10 2020-07-10
## 11 2020-07-11
## 12 2020-07-12
## 13 2020-07-13
## 14 2020-07-14
## 15 2020-07-15

The functions make_date and ymd can also be used within a mutate function to create a date object when it is not automatically constructed. Some of you did this for Project 1, for example.

Another useful function is isoweek, which returns a numeric description of the week of the year.

covid %>%
  mutate(week = isoweek(date)) %>%
  select(date, week) %>%
  unique()
## # A tibble: 198 x 2
##    date        week
##    <date>     <dbl>
##  1 2020-03-18    12
##  2 2020-03-19    12
##  3 2020-03-20    12
##  4 2020-03-21    12
##  5 2020-03-22    12
##  6 2020-03-23    13
##  7 2020-03-24    13
##  8 2020-03-25    13
##  9 2020-03-26    13
## 10 2020-03-27    13
## # … with 188 more rows

And wday returns the day of the week. Setting the options label = TRUE and abbr = FALSE returns these as an ordered factor:

covid %>%
  mutate(wday = wday(date, label = TRUE, abbr = FALSE)) %>%
  select(date, wday) %>%
  unique()
## # A tibble: 198 x 2
##    date       wday     
##    <date>     <ord>    
##  1 2020-03-18 Wednesday
##  2 2020-03-19 Thursday 
##  3 2020-03-20 Friday   
##  4 2020-03-21 Saturday 
##  5 2020-03-22 Sunday   
##  6 2020-03-23 Monday   
##  7 2020-03-24 Tuesday  
##  8 2020-03-25 Wednesday
##  9 2020-03-26 Thursday 
## 10 2020-03-27 Friday   
## # … with 188 more rows

Note that the same options (label and abbr) also exist for the month() function. Also, we can set the locale of wday within the function itself like this:

covid %>%
  mutate(wday = wday(date, locale = "fr_FR.UTF-8", label = TRUE, abbr = FALSE)) %>%
  select(date, wday) %>%
  unique()
## # A tibble: 198 x 2
##    date       wday    
##    <date>     <ord>   
##  1 2020-03-18 Mercredi
##  2 2020-03-19 Jeudi   
##  3 2020-03-20 Vendredi
##  4 2020-03-21 Samedi  
##  5 2020-03-22 Dimanche
##  6 2020-03-23 Lundi   
##  7 2020-03-24 Mardi   
##  8 2020-03-25 Mercredi
##  9 2020-03-26 Jeudi   
## 10 2020-03-27 Vendredi
## # … with 188 more rows

Otherwise, it uses the setting from Sys.setlocale. Note that setting the option inside the function does not change the locale for subsequent plots or functions.

Window Functions

Finally, we will now look at a completely different way to work with time data. Rather than working directly with a date variable, we focus on working with an ordered set of data and associating each row with the rows before and after it. To start, let’s grab just a single département from the data, make sure that it is ordered by the date variable, and select just the variable hospitalized and the date.

covid_paris <- covid %>%
  filter(departement_name == "Paris") %>%
  arrange(date) %>%
  select(date, hosp = hospitalised) # shortened name to better display below
covid_paris
## # A tibble: 198 x 2
##    date        hosp
##    <date>     <dbl>
##  1 2020-03-18   359
##  2 2020-03-19   453
##  3 2020-03-20   575
##  4 2020-03-21   649
##  5 2020-03-22   728
##  6 2020-03-23   925
##  7 2020-03-24  1076
##  8 2020-03-25  1289
##  9 2020-03-26  1493
## 10 2020-03-27  1656
## # … with 188 more rows

The function lead lets us associate a row in the data set with the value of a variable in the following row of data:

covid_paris %>%
  mutate(hosp_next = lead(hosp))
## # A tibble: 198 x 3
##    date        hosp hosp_next
##    <date>     <dbl>     <dbl>
##  1 2020-03-18   359       453
##  2 2020-03-19   453       575
##  3 2020-03-20   575       649
##  4 2020-03-21   649       728
##  5 2020-03-22   728       925
##  6 2020-03-23   925      1076
##  7 2020-03-24  1076      1289
##  8 2020-03-25  1289      1493
##  9 2020-03-26  1493      1656
## 10 2020-03-27  1656      1927
## # … with 188 more rows

By default lead looks one row down, but we can set the option n to select a different offset. For example, looking a week into the future:

covid_paris %>%
  mutate(hosp_next = lead(hosp), hosp_next_week = lead(hosp, n = 7))
## # A tibble: 198 x 4
##    date        hosp hosp_next hosp_next_week
##    <date>     <dbl>     <dbl>          <dbl>
##  1 2020-03-18   359       453           1289
##  2 2020-03-19   453       575           1493
##  3 2020-03-20   575       649           1656
##  4 2020-03-21   649       728           1927
##  5 2020-03-22   728       925           2115
##  6 2020-03-23   925      1076           2217
##  7 2020-03-24  1076      1289           2434
##  8 2020-03-25  1289      1493           2633
##  9 2020-03-26  1493      1656           2838
## 10 2020-03-27  1656      1927           2897
## # … with 188 more rows

The function lag works exactly the same way, but associates a row with the value of a variable in a preceding row. Note that the first row of data has a missing value because there is not previous row; we can change that with the option default (the same thing happens at the end of the data when using lead, it just was not as clear in the data print-out).

covid_paris %>%
  mutate(hosp_last = lag(hosp), hosp_last_default = lag(hosp, default = 0))
## # A tibble: 198 x 4
##    date        hosp hosp_last hosp_last_default
##    <date>     <dbl>     <dbl>             <dbl>
##  1 2020-03-18   359        NA                 0
##  2 2020-03-19   453       359               359
##  3 2020-03-20   575       453               453
##  4 2020-03-21   649       575               575
##  5 2020-03-22   728       649               649
##  6 2020-03-23   925       728               728
##  7 2020-03-24  1076       925               925
##  8 2020-03-25  1289      1076              1076
##  9 2020-03-26  1493      1289              1289
## 10 2020-03-27  1656      1493              1493
## # … with 188 more rows

Other functions exist to do more complex relationships between rows. One that will be helpful for us here is roll_meanr, which takes the rolling average of a variable for a fixed number of values back in the table. For example, setting n = 2 takes the average of each value of the hospitalization variable from each row and the preceding row:

covid_paris %>%
  mutate(h_mean = roll_meanr(hosp, n = 2))
## # A tibble: 198 x 3
##    date        hosp h_mean
##    <date>     <dbl>  <dbl>
##  1 2020-03-18   359    NA 
##  2 2020-03-19   453   406 
##  3 2020-03-20   575   514 
##  4 2020-03-21   649   612 
##  5 2020-03-22   728   688.
##  6 2020-03-23   925   826.
##  7 2020-03-24  1076  1000.
##  8 2020-03-25  1289  1182.
##  9 2020-03-26  1493  1391 
## 10 2020-03-27  1656  1574.
## # … with 188 more rows

There are variations roll_meanl and roll_mean for taking a rolling average with the future and symmetrically on either side. The r option (right) is the most appropriate for many kinds of time-series data.

We can see the effect of rolling averages in the following plot:

covid_paris %>%
  mutate(
    h_mean_7 = roll_meanr(hosp, n = 7),
    h_sum_30 = roll_meanr(hosp, n = 30)
  ) %>%
  ggplot(aes(x = date, y = hosp)) +
    geom_line() +
    geom_line(aes(y = h_mean_7), linetype = "dashed") +
    geom_line(aes(y = h_sum_30), linetype = "dotted")