Flight Delays

Today’s lab will look at a data set of flights. Each flight provides an indicator of whether the flight arrived at least 15 minutes later than expected. I have subset the data so that there are an equal number of delayed flights as their are non-delayed flights.

set.seed(1)

flights <- read_csv(file.path("data", "flights_small.csv")) %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
flights
## # A tibble: 6,000 x 11
##    delayed month   day weekday arr_hour dep_hour origin dest  distance carrier
##      <dbl> <dbl> <dbl>   <dbl>    <dbl>    <dbl> <chr>  <chr>    <dbl> <chr>  
##  1       1     3    27       2       14       12 LAS    PHX        256 US     
##  2       1     8    18       6       19       17 PHL    MSP        980 NW     
##  3       0     3    14       3       21       20 BOS    PHL        280 US     
##  4       1     8     9       4       12       11 BOS    LGA        185 DL     
##  5       1    11    25       7       22       21 ATL    IAH        689 DL     
##  6       0     8     7       2       16       14 LAX    PHX        370 WN     
##  7       0     9    30       7       20       19 ATL    SAN       1891 DL     
##  8       1    11    12       1       19       17 EWR    ORD        719 UA     
##  9       0    12     9       7        8        6 DTW    ATL        594 FL     
## 10       1    10     6       6       17       16 SAN    LAS        258 WN     
## # … with 5,990 more rows, and 1 more variable: train_id <chr>

Predicting Delays

To start, using a linear regression model, predict whether a flight will be delayed using the variable dep_hour, the scheduled (local) hour of departure. Print out a summary of the model.

model <- flights %>%
  filter(train_id == "train") %>%
  lm(delayed ~ dep_hour, data = .)

summary(model)
## 
## Call:
## lm(formula = delayed ~ dep_hour, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6758 -0.4756  0.3242  0.4698  0.7428 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.257187   0.025277   10.18   <2e-16 ***
## dep_hour    0.018199   0.001767   10.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.493 on 3627 degrees of freedom
## Multiple R-squared:  0.0284, Adjusted R-squared:  0.02813 
## F-statistic:   106 on 1 and 3627 DF,  p-value: < 2.2e-16

Does it appear that flights are more likely to be delayed earlier in the day or later in the day? Answer The coefficient is positive, so it seems that flights are more likely to be delayed later in the day (though keeping in mind that there will be some oddities for times in the early morning).

Using the model you just built, compute the classification rate on the training and validation set.

flights %>%
  mutate(pred = as.numeric(predict(model, newdata = .) > 0.5)) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(pred == delayed))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.581
## 2 valid         0.575

How well does the model perform relative to random guessing? Answer As the data set is balanced, random guessing would give a classification rate of 0.5. We have a rate of about 0.57, so better than random but still not particularly amazing.

Next, build a confusion matrix from the data.

flights %>%
  mutate(pred = as.numeric(predict(model, newdata = .) > 0.5)) %>%
  select(pred, delayed, train_id) %>%
  table()
## , , train_id = train
## 
##     delayed
## pred    0    1
##    0 1045  763
##    1  757 1064
## 
## , , train_id = valid
## 
##     delayed
## pred    0    1
##    0  707  516
##    1  491  657

Now, build a logistic regression model using the same variable. Print out the model summary again.

model <- flights %>%
  filter(train_id == "train") %>%
  glm(delayed ~ dep_hour, data = ., family = binomial())

summary(model)
## 
## Call:
## glm(formula = delayed ~ dep_hour, family = binomial(), data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4932  -1.1354   0.8916   1.1256   1.6165  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.990840   0.105435  -9.398   <2e-16 ***
## dep_hour     0.074267   0.007388  10.052   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5030.7  on 3628  degrees of freedom
## Residual deviance: 4926.6  on 3627  degrees of freedom
## AIC: 4930.6
## 
## Number of Fisher Scoring iterations: 4

And compute the classification rate.

flights %>%
  mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.5)) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(pred == delayed))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.581
## 2 valid         0.575

How does this compare to the linear regression model? Answer In this case, the models produce the exact same classification rates. Not surprising given the simplicity of this model.

Comparing LM and GLM

Next, let’s build two models that use departure hour, distance, and carrier to predict delay. Build one linear model and one logistic model, produce predicted probabilities for both, and plot them again one another as I did in the notes.

model_lm <- flights %>%
  filter(train_id == "train") %>%
  lm(delayed ~ dep_hour + distance + carrier, data = .)

model_glm <- flights %>%
  filter(train_id == "train") %>%
  glm(delayed ~ dep_hour + distance + carrier, data = ., family = binomial())

flights %>%
  mutate(pred_lm = predict(model_lm, newdata = .)) %>%
  mutate(pred_glm = predict(model_glm, newdata = ., type = "response")) %>%
  ggplot(aes(pred_lm, pred_glm)) +
    geom_point() +
    geom_abline(slope = 1, color = "orange")

How would you describe the relationship? Answer The two predictions are quite similar, with a few small differences for the most extreme predictions. In those cases the logistic regression predicts a less extreme probability than the linear model.

There are other options for how to relate the product Xb to the response vector y. The “canonical” option is the logistic function, but any function that maps the interval [0, 1] into the real numbers would work. Another possible choice is an inverse CDF function (if you have not taken probability, do not worry about the details). If we set the family to binomial(link = "probit"), this uses the inverse of a Gaussian/Normal distribution. Setting the link to by “cauchit” uses a Cauchy distribution.

Repeat the code you had above, but use a Cauchy link function in place of the default.

model_lm <- flights %>%
  filter(train_id == "train") %>%
  lm(delayed ~ dep_hour + distance + carrier, data = .)

model_glm <- flights %>%
  filter(train_id == "train") %>%
  glm(delayed ~ dep_hour + distance + carrier,
      data = .,
      family = binomial(link = "cauchit"))

flights %>%
  mutate(pred_lm = predict(model_lm, newdata = .)) %>%
  mutate(pred_glm = predict(model_glm, newdata = ., type = "response")) %>%
  ggplot(aes(pred_lm, pred_glm)) +
    geom_point() +
    geom_abline(slope = 1, color = "orange")

How does this compare to the original curve? Answer While still similar, the Cauchy link function differs from the linear regression more extremely and over the entire range.

Precision and Recall

So far we have used classification rate as our main metric for prediction purposes. This is the one we will use most this semester because it is the most relevant to multi-class prediction tasks, which will be our focus. However, there are other metrics which can be very useful when doing binary classification, particularly when we have a preference for a certain type of error (falsely classifying 1s as 0s, or vice-versa). Assume that we have two classes named 0 and 1, we can define the two quantities:

  • Precision: For the observations we predict to be 1, what proportion actually are?
  • Recall: What proportion of the total number of 1s did we correctly predict?

Notice that these two quantities are in competition with one another. If we predict that every flight would be delayed, we have a perfect recall but a very low precision. Likewise, predicting none of the flights to be delayed gives a perfect precision but a recall of 0%. We have another metrics that gives a balance between the two extremes called the F1 score. It is defined as the harmonic mean of the precision and recall:

F1 = 2 * (precision * recall) / (precision + recall)

In this section you are going to compute the precision recall, and F1 number for a specific model. Let’s start by building a logistic regression model using the variables: departure hour, arrival house, carrier, origin, destination, distance, and weekday. Print out the classification rate so that we have something to compare to.

model <- flights %>%
  filter(train_id == "train") %>%
  glm(delayed ~ arr_hour +
                dep_hour +
                carrier +
                origin +
                dest +
                distance +
                weekday,
      data = ., family = binomial())

flights %>%
  mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.5)) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(pred == delayed))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.619
## 2 valid         0.598

Next, using the definitions above, compute the precision, recall, and F1 scores for the training and testing sets. You’ll probably need some of the methods described in the Using R to Manipulate and Visualize Data linked to from the website.

flights %>%
  mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.5)) %>%
  group_by(train_id) %>%
  summarize(precision = sum(pred == 1 & delayed == 1) / sum(pred == 1),
            recall = sum(pred == 1 & delayed == 1) / sum(delayed == 1)) %>%
  mutate(f1 = 2 * (precision * recall) / (precision + recall))
## # A tibble: 2 x 4
##   train_id precision recall    f1
##   <chr>        <dbl>  <dbl> <dbl>
## 1 train        0.623  0.620 0.621
## 2 valid        0.592  0.604 0.598

You should see that the model provides a balance of precision and recall; both should be very similar. Now, copy the code from your previous solution, but classify points as a delay if the predicted value is greater than 0.3 (rather than 0.5):

flights %>%
  mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.45)) %>%
  group_by(train_id) %>%
  summarize(precision = sum(pred == 1 & delayed == 1) / sum(pred == 1),
            recall = sum(pred == 1 & delayed == 1) / sum(delayed == 1)) %>%
  mutate(f1 = 2 * (precision * recall) / (precision + recall))
## # A tibble: 2 x 4
##   train_id precision recall    f1
##   <chr>        <dbl>  <dbl> <dbl>
## 1 train        0.595  0.737 0.659
## 2 valid        0.565  0.717 0.632

What happens to the precision, recall, and F1 score? Answer The precision drops a bit, the recall increases significantly, and the F1 score increases by a good amount as well.

Now, repeat with the cut-off 0.7:

flights %>%
  mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.7)) %>%
  group_by(train_id) %>%
  summarize(precision = sum(pred == 1 & delayed == 1) / sum(pred == 1),
            recall = sum(pred == 1 & delayed == 1) / sum(delayed == 1)) %>%
  mutate(f1 = 2 * (precision * recall) / (precision + recall))
## # A tibble: 2 x 4
##   train_id precision recall    f1
##   <chr>        <dbl>  <dbl> <dbl>
## 1 train        0.722  0.155 0.255
## 2 valid        0.706  0.149 0.246

What happens now? Answer The precision increases to 0.7 but the recall drops to 0.14. The F1 score is much less than before.

Can you think of applications where we would prefer each of these models?

Your own model

Finally, see if you can improve on the best classification rate by changing the variables. Try to add interactions or polynomial terms. You can also wrap numeric variables in the function factor() to convert them to a categorical variable if you think it will help.

Note: I found the best model to include a factor on month and weekday and polynomial expansion on the arrival and departure hours.

model <- flights %>%
  filter(train_id == "train") %>%
  glm(delayed ~ poly(arr_hour, 3) +
                poly(dep_hour, 3) +
                carrier +
                origin +
                dest +
                distance +
                factor(month) +
                factor(weekday),
      data = ., family = binomial())

flights %>%
  mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.50)) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(pred == delayed))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.626
## 2 valid         0.620