Working with time zones is known to be a particular pain point in data science analyses. We’ll try to illustrate some general points here without making it too complicated.
All datetime objects held in R have two components:
Whenever we use functions such as hour()
and
isoweek()
, R will silently convert the UTC time to the
specific timezone and return results based on the local time.
When creating a datetime object, if we don’t specify the timezone
manually, it will be assumed that we want to use the UTC as the local
time as well as the internal representation of the time. When dealing
with data in a single time zone, this is actually fine. Both the date
and computations occur in a standard time zone so everything works as
expected. When dealing with data across different time zones we may need
to change the local time associated with a datetime using the function
with_tz
. Let’s see an example.
I’ll create a small data set of 8 times. Without setting the time zone, R assumes that these are in UTC.
<- tibble(
df time = make_datetime(2020, c(2, 2, 2, 7, 7, 7), 1, seq_len(8))
) df
## # A tibble: 8 × 1
## time
## <dttm>
## 1 2020-02-01 01:00:00
## 2 2020-02-01 02:00:00
## 3 2020-02-01 03:00:00
## 4 2020-07-01 04:00:00
## 5 2020-07-01 05:00:00
## 6 2020-07-01 06:00:00
## 7 2020-02-01 07:00:00
## 8 2020-02-01 08:00:00
We can convert these to the time in New York using the function
with_tz
:
%>%
df mutate(time_nyc = with_tz(time, tz = "America/New_York"))
## # A tibble: 8 × 2
## time time_nyc
## <dttm> <dttm>
## 1 2020-02-01 01:00:00 2020-01-31 20:00:00
## 2 2020-02-01 02:00:00 2020-01-31 21:00:00
## 3 2020-02-01 03:00:00 2020-01-31 22:00:00
## 4 2020-07-01 04:00:00 2020-07-01 00:00:00
## 5 2020-07-01 05:00:00 2020-07-01 01:00:00
## 6 2020-07-01 06:00:00 2020-07-01 02:00:00
## 7 2020-02-01 07:00:00 2020-02-01 02:00:00
## 8 2020-02-01 08:00:00 2020-02-01 03:00:00
These two columns now look different. Notice that the difference in
time is a bit different in February compared to July due to daylight
saving (UTC does not have daylight savings). If we apply the function
hour
they give different outputs:
%>%
df mutate(time_nyc = with_tz(time, tz = "America/New_York")) %>%
mutate(hour = hour(time)) %>%
mutate(hour_nyc = hour(time_nyc))
## # A tibble: 8 × 4
## time time_nyc hour hour_nyc
## <dttm> <dttm> <int> <int>
## 1 2020-02-01 01:00:00 2020-01-31 20:00:00 1 20
## 2 2020-02-01 02:00:00 2020-01-31 21:00:00 2 21
## 3 2020-02-01 03:00:00 2020-01-31 22:00:00 3 22
## 4 2020-07-01 04:00:00 2020-07-01 00:00:00 4 0
## 5 2020-07-01 05:00:00 2020-07-01 01:00:00 5 1
## 6 2020-07-01 06:00:00 2020-07-01 02:00:00 6 2
## 7 2020-02-01 07:00:00 2020-02-01 02:00:00 7 2
## 8 2020-02-01 08:00:00 2020-02-01 03:00:00 8 3
However, the actual times are exactly the same. We can see this by subtracting one from the other:
%>%
df mutate(time_nyc = with_tz(time, tz = "America/New_York")) %>%
mutate(diff = time - time_nyc)
## # A tibble: 8 × 3
## time time_nyc diff
## <dttm> <dttm> <drtn>
## 1 2020-02-01 01:00:00 2020-01-31 20:00:00 0 secs
## 2 2020-02-01 02:00:00 2020-01-31 21:00:00 0 secs
## 3 2020-02-01 03:00:00 2020-01-31 22:00:00 0 secs
## 4 2020-07-01 04:00:00 2020-07-01 00:00:00 0 secs
## 5 2020-07-01 05:00:00 2020-07-01 01:00:00 0 secs
## 6 2020-07-01 06:00:00 2020-07-01 02:00:00 0 secs
## 7 2020-02-01 07:00:00 2020-02-01 02:00:00 0 secs
## 8 2020-02-01 08:00:00 2020-02-01 03:00:00 0 secs
What if the times we input actually were in local NYC time to begin
with? As described above, we could just ignore the time zones and do
everything in UTC. We can also force the time zone using
force_tz
:
%>%
df mutate(time_nyc = force_tz(time, tz = "America/New_York")) %>%
mutate(diff = time - time_nyc)
## # A tibble: 8 × 3
## time time_nyc diff
## <dttm> <dttm> <drtn>
## 1 2020-02-01 01:00:00 2020-02-01 01:00:00 -5 hours
## 2 2020-02-01 02:00:00 2020-02-01 02:00:00 -5 hours
## 3 2020-02-01 03:00:00 2020-02-01 03:00:00 -5 hours
## 4 2020-07-01 04:00:00 2020-07-01 04:00:00 -4 hours
## 5 2020-07-01 05:00:00 2020-07-01 05:00:00 -4 hours
## 6 2020-07-01 06:00:00 2020-07-01 06:00:00 -4 hours
## 7 2020-02-01 07:00:00 2020-02-01 07:00:00 -5 hours
## 8 2020-02-01 08:00:00 2020-02-01 08:00:00 -5 hours
Notice that we have the opposite behavior from before. The dates look the same and would return the same values for functions such as hour, but actually are different.
Let’s look one more time at the forcast data you saw in Exam 03. I’ll take a small set of the variables and create a dataset from the hourly forcast.
<- read_csv("../data/other_forecast_rva_hourly.csv") dt
Notice that the times contain the marker ‘-05:00’ indicating that these times are in a local time zone that is 5 hours behind UTC (that’s the offset for the U.S. Eastern time zone in the summer). If we create a date time object from this, R will apply the offset but will display the result in UTC:
%>%
dt mutate(
start_time = ymd_hms(start_time)
)
## # A tibble: 156 × 2
## start_time temperature
## <dttm> <dbl>
## 1 2021-10-20 19:00:00 15
## 2 2021-10-20 20:00:00 16
## 3 2021-10-20 21:00:00 17
## 4 2021-10-20 22:00:00 15
## 5 2021-10-20 23:00:00 14
## 6 2021-10-21 00:00:00 13
## 7 2021-10-21 01:00:00 11
## 8 2021-10-21 02:00:00 10
## 9 2021-10-21 03:00:00 9
## 10 2021-10-21 04:00:00 8
## # … with 146 more rows
Let’s plot the temperature as a function of the time of day. Notice that the highest tempuratures occur around 8pm at night and the lowest around noon. That does not make much sense.
%>%
dt mutate(start_time = ymd_hms(start_time)) %>%
mutate(time = as_hms(start_time)) %>%
mutate(wday = wday(start_time, label = TRUE)) %>%
ggplot(aes(time, temperature)) +
geom_point(aes(color = wday)) +
geom_line(aes(color = wday)) +
scale_x_time()
We can fix this by using the with_tz
function prior to
creating the time object:
%>%
dt mutate(start_time = ymd_hms(start_time)) %>%
mutate(start_time = with_tz(start_time, "America/New_York")) %>%
mutate(time = as_hms(start_time)) %>%
mutate(wday = wday(start_time, label = TRUE)) %>%
ggplot(aes(time, temperature)) +
geom_point(aes(color = wday)) +
geom_line(aes(color = wday)) +
scale_x_time()
Now the coldest hours are in the morning and warmest are in the late-afternoon, as we would usually expect.
No homework for today, just get ready for the third exam. It was a short turn around from the second exam, and I know this is a busy week for everyone.