One of the topics we covered early on this course was the idea of a
data type, which we defined as the type of data stored in the column of
a tabular dataset. Most commonly these have been numbers
(<int>
or <dbl>
) or character
strings (<chr>
). We’ve also common across the logical
data type (<lgl>
) , which can only be either
TRUE
or FALSE
, and factors
(<fct>
). The latter are similar to character strings
but have a built-in ordering of their unique values.
In these notes, we want to consider three new data types specifically designed to work with dates and times. With some reflection, it should seem reasonable to have specific data type for date and times because they act somewhat like a number (ordered, some meaning of distances), somewhat like a categorical variable (there are fixed values), and have some unique properties (such as issues with days of the week and timezones).
The three data types we will work with are:
<date>
) to represent a
single day in time<S3: POSIXct>
) to
represent a particular time during a particular day<S3: hms>
) to represent a
time of day without reference to a specific dayThe first two are far more common than the last. Let’s start by looking at a dataset with a date column and discuss unique functions for working with these dates. Then, we’ll see how to extend this to the datetime and time objects.
As an example dataset, let’s look at a modified version of the shark
data from the first exam. The data is the same, but now we have a
specific date
column describing the data of the shark
attack. Notice that the read_csv
function has done the
conversion to a date format automatically for us.
<- read_csv("../data/shark_attacks_date.csv")
sharks sharks
## # A tibble: 367 × 11
## date outcome lon lat shark_…¹ shark…² num_s…³ provo…⁴ victi…⁵
## <date> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 1945-02-07 injured 151. -34.0 wobbego… 1.3 1 provok… fishing
## 2 1946-02-10 injured 116. -31.9 white s… 4.2 1 unprov… swimmi…
## 3 1946-08-18 fatal 146. -16.7 tiger s… 4.8 1 unprov… swimmi…
## 4 1947-12-21 injured 153. -31.7 bull sh… 2 1 provok… fishing
## 5 1948-02-12 fatal 152. -32.8 tiger s… 4 1 unprov… swimmi…
## 6 1949-01-13 uninjured 151. -33.7 white s… 3 1 unprov… boardi…
## 7 1949-01-23 fatal 152. -32.9 white s… 4 1 unprov… swimmi…
## 8 1949-04-17 fatal 146. -16.7 tiger s… 3.6 1 unprov… swimmi…
## 9 1949-05-16 injured 122. -18.0 bull sh… 2.7 1 unprov… swimmi…
## 10 1949-11-20 injured 151. -34 wobbego… 2 2 provok… swimmi…
## # … with 357 more rows, 2 more variables: age <dbl>, gender <chr>, and
## # abbreviated variable names ¹shark_type, ²shark_length, ³num_sharks,
## # ⁴provoked, ⁵victim_activity
There are a number of specific functions that can be used inside the mutate function to extract elements of the date. For example, we can get the year, month, and day of the attack as new columns with the following:
%>%
sharks mutate(
year = year(date),
month = month(date),
day = day(date),
wday = wday(date)
%>%
) select(date, year, month, day, wday)
## # A tibble: 367 × 5
## date year month day wday
## <date> <dbl> <dbl> <int> <dbl>
## 1 1945-02-07 1945 2 7 3
## 2 1946-02-10 1946 2 10 7
## 3 1946-08-18 1946 8 18 7
## 4 1947-12-21 1947 12 21 7
## 5 1948-02-12 1948 2 12 4
## 6 1949-01-13 1949 1 13 4
## 7 1949-01-23 1949 1 23 7
## 8 1949-04-17 1949 4 17 7
## 9 1949-05-16 1949 5 16 1
## 10 1949-11-20 1949 11 20 7
## # … with 357 more rows
Note that month
and wday
have options to
return character strings rather than numbers, which can be useful:
%>%
sharks mutate(
month = month(date, label = TRUE, abbr = FALSE),
wday = wday(date, label = TRUE, abbr = FALSE)
%>%
) select(date, month, wday)
## # A tibble: 367 × 3
## date month wday
## <date> <ord> <ord>
## 1 1945-02-07 February Wednesday
## 2 1946-02-10 February Sunday
## 3 1946-08-18 August Sunday
## 4 1947-12-21 December Sunday
## 5 1948-02-12 February Thursday
## 6 1949-01-13 January Thursday
## 7 1949-01-23 January Sunday
## 8 1949-04-17 April Sunday
## 9 1949-05-16 May Monday
## 10 1949-11-20 November Sunday
## # … with 357 more rows
The character strings that come out of these two functions already
have a built in ordering, which makes plots slightly easier to create
(though not the comment regarding the line layer below). Here are the
number of attacks per months in the sharks
data:
%>%
sharks mutate(
month = month(date, label = TRUE, abbr = TRUE),
%>%
) group_by(month) %>%
summarize(n = n()) %>%
ggplot(aes(month, n)) +
geom_point() +
geom_line(aes(group = 1)) # the group = 1 is needed to connect the dots
Using a date object as the x- or y-aesthetic in a plot works without
an addition effort on our part. Here, we’ll plot just the attacks in
1990, with date
on the x-axis.
%>%
sharks filter(year(date) %in% c(1990)) %>%
ggplot(aes(date, shark_type)) +
geom_point()
Often with dates, it is useful to manually modify the labels on the
time-based axis. We do this with scale_x_date
(or its
y-equivalent if time is on the y-axis). Adding this scale to a plot on
its own will have no effect, but we can change the default by changing
three parameters:
date_breaks
a string describing the frequency of the
labels, such as “month” or “2 years”date_minor_breaks
a string describing the frequency of
the grid-linesdate_labels
format string: strptimeYou can set just a subset of these, depending on how much control you want over the plot. For example, let’s label the previous plot by showing a label for every month, and using the pattern “%B” to show (just) the full month’s name.
%>%
sharks filter(year(date) %in% c(1990)) %>%
ggplot(aes(date, shark_type)) +
geom_point() +
scale_x_date(
date_breaks = "month",
date_labels = "%B",
date_minor_breaks = "month"
)
To explore datetimes, let’s look at a version of the flights dataset,
but this time from the three NYC airports. Reading the data into R shows
that one of these columns, time_hour
contains a special
object for holding information about the time of the weather
reading.
<- read_csv("../data/flights_weather.csv")
weather weather
## # A tibble: 26,110 × 5
## origin time_hour temp wind_speed visib
## <chr> <dttm> <dbl> <dbl> <dbl>
## 1 EWR 2013-01-01 06:00:00 39.0 10.4 10
## 2 EWR 2013-01-01 07:00:00 39.0 8.06 10
## 3 EWR 2013-01-01 08:00:00 39.0 11.5 10
## 4 EWR 2013-01-01 09:00:00 39.9 12.7 10
## 5 EWR 2013-01-01 10:00:00 39.0 12.7 10
## 6 EWR 2013-01-01 11:00:00 37.9 11.5 10
## 7 EWR 2013-01-01 12:00:00 39.0 15.0 10
## 8 EWR 2013-01-01 13:00:00 39.9 10.4 10
## 9 EWR 2013-01-01 14:00:00 39.9 15.0 10
## 10 EWR 2013-01-01 15:00:00 41 13.8 10
## # … with 26,100 more rows
As with the date data type, we can use a variety of functions inside a mutate to extract information about the time. All of the same date functions exists as well as some extra ones specifically for time.
%>%
weather mutate(month = month(time_hour), hour = hour(time_hour)) %>%
select(time_hour, month, hour)
## # A tibble: 26,110 × 3
## time_hour month hour
## <dttm> <dbl> <int>
## 1 2013-01-01 06:00:00 1 6
## 2 2013-01-01 07:00:00 1 7
## 3 2013-01-01 08:00:00 1 8
## 4 2013-01-01 09:00:00 1 9
## 5 2013-01-01 10:00:00 1 10
## 6 2013-01-01 11:00:00 1 11
## 7 2013-01-01 12:00:00 1 12
## 8 2013-01-01 13:00:00 1 13
## 9 2013-01-01 14:00:00 1 14
## 10 2013-01-01 15:00:00 1 15
## # … with 26,100 more rows
Also, as with dates, we can set the time variable to the x- or y-aesthetic of a ggplot graphic and it will work as expected. Here is a plot showing the tempurature over the course of one day at JFK airport.
%>%
weather filter(origin == "JFK") %>%
filter(year(time_hour) == 2013) %>%
filter(month(time_hour) == 5) %>%
filter(day(time_hour) == 1) %>%
ggplot(aes(time_hour, temp)) +
geom_point() +
geom_line()
We can modify the axis labels with scale_x_datetime
.
Note that this is a different function that required for dates, but the
options are the same.
%>%
weather filter(origin == "JFK") %>%
filter(year(time_hour) == 2013) %>%
filter(month(time_hour) == 5) %>%
filter(day(time_hour) == 1) %>%
ggplot(aes(time_hour, temp)) +
geom_point() +
geom_line() +
scale_x_datetime(
date_breaks = "2 hours",
date_labels = "%H"
)
Here’s another example showing how to label the axis with the days of
the week and how to select a week of the year using the
isoweek
function:
%>%
weather filter(origin == "JFK") %>%
filter(year(time_hour) == 2013) %>%
filter(isoweek(time_hour) == 12) %>%
ggplot(aes(time_hour, temp)) +
geom_line(aes(color = factor(day(time_hour)))) +
scale_x_datetime(
date_breaks = "1 day",
date_labels = "%a"
)
We can convert any datetime object to a date object or time object
using the functions as_date
and as_hms
. For
example:
%>%
weather mutate(time_date = as_date(time_hour), time = as_hms(time_hour)) %>%
select(time_hour, time_date, time)
## # A tibble: 26,110 × 3
## time_hour time_date time
## <dttm> <date> <time>
## 1 2013-01-01 06:00:00 2013-01-01 06:00
## 2 2013-01-01 07:00:00 2013-01-01 07:00
## 3 2013-01-01 08:00:00 2013-01-01 08:00
## 4 2013-01-01 09:00:00 2013-01-01 09:00
## 5 2013-01-01 10:00:00 2013-01-01 10:00
## 6 2013-01-01 11:00:00 2013-01-01 11:00
## 7 2013-01-01 12:00:00 2013-01-01 12:00
## 8 2013-01-01 13:00:00 2013-01-01 13:00
## 9 2013-01-01 14:00:00 2013-01-01 14:00
## 10 2013-01-01 15:00:00 2013-01-01 15:00
## # … with 26,100 more rows
The function ymd
will create a date object from a
string. Putting this together with the as_date
function, we
can simplify the code we used to select a particular day of the
year:
%>%
weather filter(as_date(time_hour) == ymd("2013-05-01"))
## # A tibble: 72 × 5
## origin time_hour temp wind_speed visib
## <chr> <dttm> <dbl> <dbl> <dbl>
## 1 EWR 2013-05-01 00:00:00 57.0 4.60 10
## 2 EWR 2013-05-01 01:00:00 55.9 3.45 10
## 3 EWR 2013-05-01 02:00:00 55.0 3.45 10
## 4 EWR 2013-05-01 03:00:00 54.0 0 10
## 5 EWR 2013-05-01 04:00:00 52.0 0 10
## 6 EWR 2013-05-01 05:00:00 53.1 0 10
## 7 EWR 2013-05-01 06:00:00 48.9 3.45 10
## 8 EWR 2013-05-01 07:00:00 45.0 6.90 10
## 9 EWR 2013-05-01 08:00:00 46.0 8.06 10
## 10 EWR 2013-05-01 09:00:00 44.1 6.90 10
## # … with 62 more rows
While it is rare to deal with time data without dates in the original
data, this comes up frequently when working with certain kinds of
analysis. For example, here we can use the as_hms
function
to show the daily tempurature changes over the course of a week:
%>%
weather filter(origin == "JFK") %>%
filter(year(time_hour) == 2013) %>%
filter(isoweek(time_hour) == 12) %>%
mutate(time = as_hms(time_hour)) %>%
mutate(day = day(time_hour)) %>%
ggplot(aes(time, temp)) +
geom_line(aes(color = factor(day)))
The functionality for time objects is not as good as for date and datetime objects, so try to make use of the latter two whenever possible.
The canonical and best way to store a date object inside of a CSV file is to format the date as “YYYY-MM-DD”. Datetimes can be similarly stored in the format “YYYY-MM-DD HH:MM” or “YYYY-MM-DD HH:MM:SS” and times as “HH:MM” or “HH:MM:SS”. Write down on a piece of paper the following information in the appropriate format.