library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)

Predicting Bike Usage

Today, we will look at a data set of bike sharing usage in Washington, DC. Our task is to predict how many bikes are rented each day. As we saw in the previous set of notes, it will be useful to split the data set into training and validation sets. We will do this right as the data is read into R.

set.seed(1)

bikes <- read_csv("data/bikes.csv") %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
bikes
## # A tibble: 731 x 9
##    count season year  workingday weather  temp humidity windspeed train_id
##    <dbl> <chr>  <chr>      <dbl> <chr>   <dbl>    <dbl>     <dbl> <chr>   
##  1   985 Winter one            0 Rain     6.72    0.806     17.3  valid   
##  2   801 Winter one            0 Rain     7.99    0.696     26.8  valid   
##  3  1349 Winter one            1 Dry     -3.04    0.437     26.8  train   
##  4  1562 Winter one            1 Dry     -2.8     0.590     17.3  train   
##  5  1600 Winter one            1 Dry     -1.02    0.437     20.2  train   
##  6  1606 Winter one            1 Dry     -2.51    0.518      9.66 valid   
##  7  1510 Winter one            1 Rain    -3.03    0.499     18.2  train   
##  8   959 Winter one            0 Rain    -5.11    0.536     28.8  valid   
##  9   822 Winter one            0 Dry     -6.87    0.434     39.0  train   
## 10  1321 Winter one            1 Dry     -6.05    0.483     24.1  valid   
## # … with 721 more rows

Non-linear relationships

The data comes from two different years. There was an overall increase in usage during the second year, so it is useful to start by looking at just the second year (we will see how to incorporate both together in just a moment). Here is a plot that investigates the relationship between the maximum daily temperature (degrees Celsius) and the number of bikes that are rented.

bikes %>%
  filter(year == "two") %>%
  ggplot(aes(temp, count)) +
    geom_point()

Generally, people are more willing to rent bikes when the weather is warmer. However, the usage drops a bit when the temperatures are too high. The peak usage visually seems to be right around 20-25°, which is right around the optimal temperature for a bike ride.

We will start by building a linear model between the temperature and the bike usage in year two. Rather than worrying about the RMSE, we will start by just visually inspecting the results.

model <- bikes %>%
  filter(year == "two") %>%
  filter(train_id == "train") %>%
  lm(count ~ temp, data = .)

bikes %>%
  filter(year == "two") %>%
  mutate(score_pred = predict(model, newdata = .)) %>%
  ggplot(aes(temp, count)) +
    geom_point() +
    geom_line(aes(y = score_pred), color = "orange", size = 1.5)