One of the topics we covered early on this course was the idea of a data type, which we defined as the type of data stored in the column of a tabular dataset. Most commonly these have been numbers (<int> or <dbl>) or character strings (<chr>). We’ve also common across the logical data type (<lgl>) , which can only be either TRUE or FALSE, and factors (<fct>). The latter are similar to character strings but have a built-in ordering of their unique values.

In these notes, we want to consider three new data types specifically designed to work with dates and times. With some reflection, it should seem reasonable to have specific data type for date and times because they act somewhat like a number (ordered, some meaning of distances), somewhat like a categorical variable (there are fixed values), and have some unique properties (such as issues with days of the week and timezones).

The three data types we will work with are:

The first two are far more common than the last. Let’s start by looking at a dataset with a date column and discuss unique functions for working with these dates. Then, we’ll see how to extend this to the datetime and time objects.

Dates

As an example dataset, let’s look at a modified version of the shark data from the first exam. The data is the same, but now we have a specific date column describing the data of the shark attack. Notice that the read_csv function has done the conversion to a date format automatically for us.

sharks <- read_csv("../data/shark_attacks_date.csv")
sharks
## # A tibble: 367 × 11
##    date       outcome     lon   lat shark_…¹ shark…² num_s…³ provo…⁴ victi…⁵
##    <date>     <chr>     <dbl> <dbl> <chr>      <dbl>   <dbl> <chr>   <chr>  
##  1 1945-02-07 injured    151. -34.0 wobbego…     1.3       1 provok… fishing
##  2 1946-02-10 injured    116. -31.9 white s…     4.2       1 unprov… swimmi…
##  3 1946-08-18 fatal      146. -16.7 tiger s…     4.8       1 unprov… swimmi…
##  4 1947-12-21 injured    153. -31.7 bull sh…     2         1 provok… fishing
##  5 1948-02-12 fatal      152. -32.8 tiger s…     4         1 unprov… swimmi…
##  6 1949-01-13 uninjured  151. -33.7 white s…     3         1 unprov… boardi…
##  7 1949-01-23 fatal      152. -32.9 white s…     4         1 unprov… swimmi…
##  8 1949-04-17 fatal      146. -16.7 tiger s…     3.6       1 unprov… swimmi…
##  9 1949-05-16 injured    122. -18.0 bull sh…     2.7       1 unprov… swimmi…
## 10 1949-11-20 injured    151. -34   wobbego…     2         2 provok… swimmi…
## # … with 357 more rows, 2 more variables: age <dbl>, gender <chr>, and
## #   abbreviated variable names ¹​shark_type, ²​shark_length, ³​num_sharks,
## #   ⁴​provoked, ⁵​victim_activity

There are a number of specific functions that can be used inside the mutate function to extract elements of the date. For example, we can get the year, month, and day of the attack as new columns with the following:

sharks %>%
  mutate(
    year = year(date),
    month = month(date),
    day = day(date),
    wday = wday(date)
  ) %>%
  select(date, year, month, day, wday)
## # A tibble: 367 × 5
##    date        year month   day  wday
##    <date>     <dbl> <dbl> <int> <dbl>
##  1 1945-02-07  1945     2     7     3
##  2 1946-02-10  1946     2    10     7
##  3 1946-08-18  1946     8    18     7
##  4 1947-12-21  1947    12    21     7
##  5 1948-02-12  1948     2    12     4
##  6 1949-01-13  1949     1    13     4
##  7 1949-01-23  1949     1    23     7
##  8 1949-04-17  1949     4    17     7
##  9 1949-05-16  1949     5    16     1
## 10 1949-11-20  1949    11    20     7
## # … with 357 more rows

Note that month and wday have options to return character strings rather than numbers, which can be useful:

sharks %>%
  mutate(
    month = month(date, label = TRUE, abbr = FALSE),
    wday = wday(date, label = TRUE, abbr = FALSE)
  ) %>%
  select(date, month, wday)
## # A tibble: 367 × 3
##    date       month    wday     
##    <date>     <ord>    <ord>    
##  1 1945-02-07 February Wednesday
##  2 1946-02-10 February Sunday   
##  3 1946-08-18 August   Sunday   
##  4 1947-12-21 December Sunday   
##  5 1948-02-12 February Thursday 
##  6 1949-01-13 January  Thursday 
##  7 1949-01-23 January  Sunday   
##  8 1949-04-17 April    Sunday   
##  9 1949-05-16 May      Monday   
## 10 1949-11-20 November Sunday   
## # … with 357 more rows

The character strings that come out of these two functions already have a built in ordering, which makes plots slightly easier to create (though not the comment regarding the line layer below). Here are the number of attacks per months in the sharks data:

sharks %>%
  mutate(
    month = month(date, label = TRUE, abbr = TRUE),
  ) %>% 
  group_by(month) %>%
  summarize(n = n()) %>%
  ggplot(aes(month, n)) +
    geom_point() +
    geom_line(aes(group = 1))  # the group = 1 is needed to connect the dots

Using a date object as the x- or y-aesthetic in a plot works without an addition effort on our part. Here, we’ll plot just the attacks in 1990, with date on the x-axis.

sharks %>%
  filter(year(date) %in% c(1990)) %>%
  ggplot(aes(date, shark_type)) +
    geom_point()

Often with dates, it is useful to manually modify the labels on the time-based axis. We do this with scale_x_date (or its y-equivalent if time is on the y-axis). Adding this scale to a plot on its own will have no effect, but we can change the default by changing three parameters:

You can set just a subset of these, depending on how much control you want over the plot. For example, let’s label the previous plot by showing a label for every month, and using the pattern “%B” to show (just) the full month’s name.

sharks %>%
  filter(year(date) %in% c(1990)) %>%
  ggplot(aes(date, shark_type)) +
    geom_point() +
    scale_x_date(
      date_breaks = "month",
      date_labels = "%B",
      date_minor_breaks = "month"
    )

Datetimes

To explore datetimes, let’s look at a version of the flights dataset, but this time from the three NYC airports. Reading the data into R shows that one of these columns, time_hour contains a special object for holding information about the time of the weather reading.

weather <- read_csv("../data/flights_weather.csv")
weather
## # A tibble: 26,110 × 5
##    origin time_hour            temp wind_speed visib
##    <chr>  <dttm>              <dbl>      <dbl> <dbl>
##  1 EWR    2013-01-01 06:00:00  39.0      10.4     10
##  2 EWR    2013-01-01 07:00:00  39.0       8.06    10
##  3 EWR    2013-01-01 08:00:00  39.0      11.5     10
##  4 EWR    2013-01-01 09:00:00  39.9      12.7     10
##  5 EWR    2013-01-01 10:00:00  39.0      12.7     10
##  6 EWR    2013-01-01 11:00:00  37.9      11.5     10
##  7 EWR    2013-01-01 12:00:00  39.0      15.0     10
##  8 EWR    2013-01-01 13:00:00  39.9      10.4     10
##  9 EWR    2013-01-01 14:00:00  39.9      15.0     10
## 10 EWR    2013-01-01 15:00:00  41        13.8     10
## # … with 26,100 more rows

As with the date data type, we can use a variety of functions inside a mutate to extract information about the time. All of the same date functions exists as well as some extra ones specifically for time.

weather %>%
  mutate(month = month(time_hour), hour = hour(time_hour)) %>%
  select(time_hour, month, hour)
## # A tibble: 26,110 × 3
##    time_hour           month  hour
##    <dttm>              <dbl> <int>
##  1 2013-01-01 06:00:00     1     6
##  2 2013-01-01 07:00:00     1     7
##  3 2013-01-01 08:00:00     1     8
##  4 2013-01-01 09:00:00     1     9
##  5 2013-01-01 10:00:00     1    10
##  6 2013-01-01 11:00:00     1    11
##  7 2013-01-01 12:00:00     1    12
##  8 2013-01-01 13:00:00     1    13
##  9 2013-01-01 14:00:00     1    14
## 10 2013-01-01 15:00:00     1    15
## # … with 26,100 more rows

Also, as with dates, we can set the time variable to the x- or y-aesthetic of a ggplot graphic and it will work as expected. Here is a plot showing the tempurature over the course of one day at JFK airport.

weather %>%
  filter(origin == "JFK") %>%
  filter(year(time_hour) == 2013) %>%
  filter(month(time_hour) == 5) %>%
  filter(day(time_hour) == 1) %>%
  ggplot(aes(time_hour, temp)) +
    geom_point() +
    geom_line()

We can modify the axis labels with scale_x_datetime. Note that this is a different function that required for dates, but the options are the same.

weather %>%
  filter(origin == "JFK") %>%
  filter(year(time_hour) == 2013) %>%
  filter(month(time_hour) == 5) %>%
  filter(day(time_hour) == 1) %>%
  ggplot(aes(time_hour, temp)) +
    geom_point() +
    geom_line()  +
    scale_x_datetime(
      date_breaks = "2 hours",
      date_labels = "%H"
    )

Here’s another example showing how to label the axis with the days of the week and how to select a week of the year using the isoweek function:

weather %>%
  filter(origin == "JFK") %>%
  filter(year(time_hour) == 2013) %>%
  filter(isoweek(time_hour) == 12) %>%
  ggplot(aes(time_hour, temp)) +
    geom_line(aes(color = factor(day(time_hour))))  +
    scale_x_datetime(
      date_breaks = "1 day",
      date_labels = "%a"
    )

Time and Conversion

We can convert any datetime object to a date object or time object using the functions as_date and as_hms. For example:

weather %>%
  mutate(time_date = as_date(time_hour), time = as_hms(time_hour)) %>%
  select(time_hour, time_date, time)
## # A tibble: 26,110 × 3
##    time_hour           time_date  time  
##    <dttm>              <date>     <time>
##  1 2013-01-01 06:00:00 2013-01-01 06:00 
##  2 2013-01-01 07:00:00 2013-01-01 07:00 
##  3 2013-01-01 08:00:00 2013-01-01 08:00 
##  4 2013-01-01 09:00:00 2013-01-01 09:00 
##  5 2013-01-01 10:00:00 2013-01-01 10:00 
##  6 2013-01-01 11:00:00 2013-01-01 11:00 
##  7 2013-01-01 12:00:00 2013-01-01 12:00 
##  8 2013-01-01 13:00:00 2013-01-01 13:00 
##  9 2013-01-01 14:00:00 2013-01-01 14:00 
## 10 2013-01-01 15:00:00 2013-01-01 15:00 
## # … with 26,100 more rows

The function ymd will create a date object from a string. Putting this together with the as_date function, we can simplify the code we used to select a particular day of the year:

weather %>%
  filter(as_date(time_hour) == ymd("2013-05-01"))
## # A tibble: 72 × 5
##    origin time_hour            temp wind_speed visib
##    <chr>  <dttm>              <dbl>      <dbl> <dbl>
##  1 EWR    2013-05-01 00:00:00  57.0       4.60    10
##  2 EWR    2013-05-01 01:00:00  55.9       3.45    10
##  3 EWR    2013-05-01 02:00:00  55.0       3.45    10
##  4 EWR    2013-05-01 03:00:00  54.0       0       10
##  5 EWR    2013-05-01 04:00:00  52.0       0       10
##  6 EWR    2013-05-01 05:00:00  53.1       0       10
##  7 EWR    2013-05-01 06:00:00  48.9       3.45    10
##  8 EWR    2013-05-01 07:00:00  45.0       6.90    10
##  9 EWR    2013-05-01 08:00:00  46.0       8.06    10
## 10 EWR    2013-05-01 09:00:00  44.1       6.90    10
## # … with 62 more rows

While it is rare to deal with time data without dates in the original data, this comes up frequently when working with certain kinds of analysis. For example, here we can use the as_hms function to show the daily tempurature changes over the course of a week:

weather %>%
  filter(origin == "JFK") %>%
  filter(year(time_hour) == 2013) %>%
  filter(isoweek(time_hour) == 12) %>%
  mutate(time = as_hms(time_hour)) %>%
  mutate(day = day(time_hour)) %>%
  ggplot(aes(time, temp)) +
    geom_line(aes(color = factor(day)))

The functionality for time objects is not as good as for date and datetime objects, so try to make use of the latter two whenever possible.

Homework

The canonical and best way to store a date object inside of a CSV file is to format the date as “YYYY-MM-DD”. Datetimes can be similarly stored in the format “YYYY-MM-DD HH:MM” or “YYYY-MM-DD HH:MM:SS” and times as “HH:MM” or “HH:MM:SS”. Write down on a piece of paper the following information in the appropriate format.

  1. When you woke up this morning.
  2. The first day of this semester.
  3. When you were born.
  4. When you’re alarm is currently set for.
  5. The date of your next birthday.