library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)
theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)
Today, we will look at a data set of bike sharing usage in Washington, DC. Our task is to predict how many bikes are rented each day. As we saw in the previous set of notes, it will be useful to split the data set into training and validation sets. We will do this right as the data is read into R.
set.seed(1)
<- read_csv("data/bikes.csv") %>%
bikes mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
bikes
## # A tibble: 731 x 9
## count season year workingday weather temp humidity windspeed train_id
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 985 Winter one 0 Rain 6.72 0.806 17.3 valid
## 2 801 Winter one 0 Rain 7.99 0.696 26.8 valid
## 3 1349 Winter one 1 Dry -3.04 0.437 26.8 train
## 4 1562 Winter one 1 Dry -2.8 0.590 17.3 train
## 5 1600 Winter one 1 Dry -1.02 0.437 20.2 train
## 6 1606 Winter one 1 Dry -2.51 0.518 9.66 valid
## 7 1510 Winter one 1 Rain -3.03 0.499 18.2 train
## 8 959 Winter one 0 Rain -5.11 0.536 28.8 valid
## 9 822 Winter one 0 Dry -6.87 0.434 39.0 train
## 10 1321 Winter one 1 Dry -6.05 0.483 24.1 valid
## # … with 721 more rows
The data comes from two different years. There was an overall increase in usage during the second year, so it is useful to start by looking at just the second year (we will see how to incorporate both together in just a moment). Here is a plot that investigates the relationship between the maximum daily temperature (degrees Celsius) and the number of bikes that are rented.
%>%
bikes filter(year == "two") %>%
ggplot(aes(temp, count)) +
geom_point()
Generally, people are more willing to rent bikes when the weather is warmer. However, the usage drops a bit when the temperatures are too high. The peak usage visually seems to be right around 20-25°, which is right around the optimal temperature for a bike ride.
We will start by building a linear model between the temperature and the bike usage in year two. Rather than worrying about the RMSE, we will start by just visually inspecting the results.
<- bikes %>%
model filter(year == "two") %>%
filter(train_id == "train") %>%
lm(count ~ temp, data = .)
%>%
bikes filter(year == "two") %>%
mutate(score_pred = predict(model, newdata = .)) %>%
ggplot(aes(temp, count)) +
geom_point() +
geom_line(aes(y = score_pred), color = "orange", size = 1.5)