Working with time zones is known to be a particular pain point in data science analyses. We’ll try to illustrate some general points here without making it too complicated.

All datetime objects held in R have two components:

Whenever we use functions such as hour() and isoweek(), R will silently convert the UTC time to the specific timezone and return results based on the local time.

When creating a datetime object, if we don’t specify the timezone manually, it will be assumed that we want to use the UTC as the local time as well as the internal representation of the time. When dealing with data in a single time zone, this is actually fine. Both the date and computations occur in a standard time zone so everything works as expected. When dealing with data across different time zones we may need to change the local time associated with a datetime using the function with_tz. Let’s see an example.

I’ll create a small data set of 8 times. Without setting the time zone, R assumes that these are in UTC.

df <- tibble(
  time = make_datetime(2020, c(2, 2, 2, 7, 7, 7), 1, seq_len(8))
)
df
## # A tibble: 8 × 1
##   time               
##   <dttm>             
## 1 2020-02-01 01:00:00
## 2 2020-02-01 02:00:00
## 3 2020-02-01 03:00:00
## 4 2020-07-01 04:00:00
## 5 2020-07-01 05:00:00
## 6 2020-07-01 06:00:00
## 7 2020-02-01 07:00:00
## 8 2020-02-01 08:00:00

We can convert these to the time in New York using the function with_tz:

df %>%
  mutate(time_nyc = with_tz(time, tz = "America/New_York"))
## # A tibble: 8 × 2
##   time                time_nyc           
##   <dttm>              <dttm>             
## 1 2020-02-01 01:00:00 2020-01-31 20:00:00
## 2 2020-02-01 02:00:00 2020-01-31 21:00:00
## 3 2020-02-01 03:00:00 2020-01-31 22:00:00
## 4 2020-07-01 04:00:00 2020-07-01 00:00:00
## 5 2020-07-01 05:00:00 2020-07-01 01:00:00
## 6 2020-07-01 06:00:00 2020-07-01 02:00:00
## 7 2020-02-01 07:00:00 2020-02-01 02:00:00
## 8 2020-02-01 08:00:00 2020-02-01 03:00:00

These two columns now look different. Notice that the difference in time is a bit different in February compared to July due to daylight saving (UTC does not have daylight savings). If we apply the function hour they give different outputs:

df %>%
  mutate(time_nyc = with_tz(time, tz = "America/New_York")) %>%
  mutate(hour = hour(time)) %>%
  mutate(hour_nyc = hour(time_nyc))
## # A tibble: 8 × 4
##   time                time_nyc             hour hour_nyc
##   <dttm>              <dttm>              <int>    <int>
## 1 2020-02-01 01:00:00 2020-01-31 20:00:00     1       20
## 2 2020-02-01 02:00:00 2020-01-31 21:00:00     2       21
## 3 2020-02-01 03:00:00 2020-01-31 22:00:00     3       22
## 4 2020-07-01 04:00:00 2020-07-01 00:00:00     4        0
## 5 2020-07-01 05:00:00 2020-07-01 01:00:00     5        1
## 6 2020-07-01 06:00:00 2020-07-01 02:00:00     6        2
## 7 2020-02-01 07:00:00 2020-02-01 02:00:00     7        2
## 8 2020-02-01 08:00:00 2020-02-01 03:00:00     8        3

However, the actual times are exactly the same. We can see this by subtracting one from the other:

df %>%
  mutate(time_nyc = with_tz(time, tz = "America/New_York")) %>%
  mutate(diff = time - time_nyc)
## # A tibble: 8 × 3
##   time                time_nyc            diff  
##   <dttm>              <dttm>              <drtn>
## 1 2020-02-01 01:00:00 2020-01-31 20:00:00 0 secs
## 2 2020-02-01 02:00:00 2020-01-31 21:00:00 0 secs
## 3 2020-02-01 03:00:00 2020-01-31 22:00:00 0 secs
## 4 2020-07-01 04:00:00 2020-07-01 00:00:00 0 secs
## 5 2020-07-01 05:00:00 2020-07-01 01:00:00 0 secs
## 6 2020-07-01 06:00:00 2020-07-01 02:00:00 0 secs
## 7 2020-02-01 07:00:00 2020-02-01 02:00:00 0 secs
## 8 2020-02-01 08:00:00 2020-02-01 03:00:00 0 secs

What if the times we input actually were in local NYC time to begin with? As described above, we could just ignore the time zones and do everything in UTC. We can also force the time zone using force_tz:

df %>%
  mutate(time_nyc = force_tz(time, tz = "America/New_York")) %>%
  mutate(diff = time - time_nyc)
## # A tibble: 8 × 3
##   time                time_nyc            diff    
##   <dttm>              <dttm>              <drtn>  
## 1 2020-02-01 01:00:00 2020-02-01 01:00:00 -5 hours
## 2 2020-02-01 02:00:00 2020-02-01 02:00:00 -5 hours
## 3 2020-02-01 03:00:00 2020-02-01 03:00:00 -5 hours
## 4 2020-07-01 04:00:00 2020-07-01 04:00:00 -4 hours
## 5 2020-07-01 05:00:00 2020-07-01 05:00:00 -4 hours
## 6 2020-07-01 06:00:00 2020-07-01 06:00:00 -4 hours
## 7 2020-02-01 07:00:00 2020-02-01 07:00:00 -5 hours
## 8 2020-02-01 08:00:00 2020-02-01 08:00:00 -5 hours

Notice that we have the opposite behavior from before. The dates look the same and would return the same values for functions such as hour, but actually are different.

Application

Let’s look one more time at the forcast data you saw in Exam 03. I’ll take a small set of the variables and create a dataset from the hourly forcast.

dt <- read_csv("../data/other_forecast_rva_hourly.csv")

Notice that the times contain the marker ‘-05:00’ indicating that these times are in a local time zone that is 5 hours behind UTC (that’s the offset for the U.S. Eastern time zone in the summer). If we create a date time object from this, R will apply the offset but will display the result in UTC:

dt %>%
  mutate(
    start_time = ymd_hms(start_time)
  )
## # A tibble: 156 × 2
##    start_time          temperature
##    <dttm>                    <dbl>
##  1 2021-10-20 19:00:00          15
##  2 2021-10-20 20:00:00          16
##  3 2021-10-20 21:00:00          17
##  4 2021-10-20 22:00:00          15
##  5 2021-10-20 23:00:00          14
##  6 2021-10-21 00:00:00          13
##  7 2021-10-21 01:00:00          11
##  8 2021-10-21 02:00:00          10
##  9 2021-10-21 03:00:00           9
## 10 2021-10-21 04:00:00           8
## # … with 146 more rows

Let’s plot the temperature as a function of the time of day. Notice that the highest tempuratures occur around 8pm at night and the lowest around noon. That does not make much sense.

dt %>%
  mutate(start_time = ymd_hms(start_time)) %>%
  mutate(time = as_hms(start_time)) %>%
  mutate(wday = wday(start_time, label = TRUE)) %>%
  ggplot(aes(time, temperature)) +
    geom_point(aes(color = wday)) +
    geom_line(aes(color = wday)) +
    scale_x_time()

We can fix this by using the with_tz function prior to creating the time object:

dt %>%
  mutate(start_time = ymd_hms(start_time)) %>%
  mutate(start_time = with_tz(start_time, "America/New_York")) %>%
  mutate(time = as_hms(start_time)) %>%
  mutate(wday = wday(start_time, label = TRUE)) %>%
  ggplot(aes(time, temperature)) +
    geom_point(aes(color = wday)) +
    geom_line(aes(color = wday)) +
    scale_x_time()

Now the coldest hours are in the morning and warmest are in the late-afternoon, as we would usually expect.

Homework

No homework for today, just get ready for the third exam. It was a short turn around from the second exam, and I know this is a busy week for everyone.