Today’s lab will look at a data set of flights. Each flight provides an indicator of whether the flight arrived at least 15 minutes later than expected. I have subset the data so that there are an equal number of delayed flights as their are non-delayed flights.

```
set.seed(1)
flights <- read_csv(file.path("data", "flights_small.csv")) %>%
mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
flights
```

```
## # A tibble: 6,000 x 11
## delayed month day weekday arr_hour dep_hour origin dest distance carrier
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 3 27 2 14 12 LAS PHX 256 US
## 2 1 8 18 6 19 17 PHL MSP 980 NW
## 3 0 3 14 3 21 20 BOS PHL 280 US
## 4 1 8 9 4 12 11 BOS LGA 185 DL
## 5 1 11 25 7 22 21 ATL IAH 689 DL
## 6 0 8 7 2 16 14 LAX PHX 370 WN
## 7 0 9 30 7 20 19 ATL SAN 1891 DL
## 8 1 11 12 1 19 17 EWR ORD 719 UA
## 9 0 12 9 7 8 6 DTW ATL 594 FL
## 10 1 10 6 6 17 16 SAN LAS 258 WN
## # … with 5,990 more rows, and 1 more variable: train_id <chr>
```

To start, using a linear regression model, predict whether a flight will be delayed using the variable `dep_hour`

, the scheduled (local) hour of departure. Print out a summary of the model.

```
model <- flights %>%
filter(train_id == "train") %>%
lm(delayed ~ dep_hour, data = .)
summary(model)
```

```
##
## Call:
## lm(formula = delayed ~ dep_hour, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6758 -0.4756 0.3242 0.4698 0.7428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.257187 0.025277 10.18 <2e-16 ***
## dep_hour 0.018199 0.001767 10.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.493 on 3627 degrees of freedom
## Multiple R-squared: 0.0284, Adjusted R-squared: 0.02813
## F-statistic: 106 on 1 and 3627 DF, p-value: < 2.2e-16
```

Does it appear that flights are more likely to be delayed earlier in the day or later in the day? **Answer** The coefficient is positive, so it seems that flights are more likely to be delayed later in the day (though keeping in mind that there will be some oddities for times in the early morning).

Using the model you just built, compute the classification rate on the training and validation set.

```
flights %>%
mutate(pred = as.numeric(predict(model, newdata = .) > 0.5)) %>%
group_by(train_id) %>%
summarize(class_rate = mean(pred == delayed))
```

```
## # A tibble: 2 x 2
## train_id class_rate
## <chr> <dbl>
## 1 train 0.581
## 2 valid 0.575
```

How well does the model perform relative to random guessing? **Answer** As the data set is balanced, random guessing would give a classification rate of 0.5. We have a rate of about 0.57, so better than random but still not particularly amazing.

Next, build a confusion matrix from the data.

```
flights %>%
mutate(pred = as.numeric(predict(model, newdata = .) > 0.5)) %>%
select(pred, delayed, train_id) %>%
table()
```

```
## , , train_id = train
##
## delayed
## pred 0 1
## 0 1045 763
## 1 757 1064
##
## , , train_id = valid
##
## delayed
## pred 0 1
## 0 707 516
## 1 491 657
```

Now, build a logistic regression model using the same variable. Print out the model summary again.

```
model <- flights %>%
filter(train_id == "train") %>%
glm(delayed ~ dep_hour, data = ., family = binomial())
summary(model)
```

```
##
## Call:
## glm(formula = delayed ~ dep_hour, family = binomial(), data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4932 -1.1354 0.8916 1.1256 1.6165
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.990840 0.105435 -9.398 <2e-16 ***
## dep_hour 0.074267 0.007388 10.052 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5030.7 on 3628 degrees of freedom
## Residual deviance: 4926.6 on 3627 degrees of freedom
## AIC: 4930.6
##
## Number of Fisher Scoring iterations: 4
```

And compute the classification rate.

```
flights %>%
mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.5)) %>%
group_by(train_id) %>%
summarize(class_rate = mean(pred == delayed))
```

```
## # A tibble: 2 x 2
## train_id class_rate
## <chr> <dbl>
## 1 train 0.581
## 2 valid 0.575
```

How does this compare to the linear regression model? **Answer** In this case, the models produce the exact same classification rates. Not surprising given the simplicity of this model.

Next, let’s build two models that use departure hour, distance, and carrier to predict delay. Build one linear model and one logistic model, produce predicted probabilities for both, and plot them again one another as I did in the notes.

```
model_lm <- flights %>%
filter(train_id == "train") %>%
lm(delayed ~ dep_hour + distance + carrier, data = .)
model_glm <- flights %>%
filter(train_id == "train") %>%
glm(delayed ~ dep_hour + distance + carrier, data = ., family = binomial())
flights %>%
mutate(pred_lm = predict(model_lm, newdata = .)) %>%
mutate(pred_glm = predict(model_glm, newdata = ., type = "response")) %>%
ggplot(aes(pred_lm, pred_glm)) +
geom_point() +
geom_abline(slope = 1, color = "orange")
```

How would you describe the relationship? **Answer** The two predictions are quite similar, with a few small differences for the most extreme predictions. In those cases the logistic regression predicts a less extreme probability than the linear model.

There are other options for how to relate the product Xb to the response vector y. The “canonical” option is the logistic function, but any function that maps the interval [0, 1] into the real numbers would work. Another possible choice is an inverse CDF function (if you have not taken probability, do not worry about the details). If we set the family to `binomial(link = "probit")`

, this uses the inverse of a Gaussian/Normal distribution. Setting the link to by “cauchit” uses a Cauchy distribution.

Repeat the code you had above, but use a Cauchy link function in place of the default.

```
model_lm <- flights %>%
filter(train_id == "train") %>%
lm(delayed ~ dep_hour + distance + carrier, data = .)
model_glm <- flights %>%
filter(train_id == "train") %>%
glm(delayed ~ dep_hour + distance + carrier,
data = .,
family = binomial(link = "cauchit"))
flights %>%
mutate(pred_lm = predict(model_lm, newdata = .)) %>%
mutate(pred_glm = predict(model_glm, newdata = ., type = "response")) %>%
ggplot(aes(pred_lm, pred_glm)) +
geom_point() +
geom_abline(slope = 1, color = "orange")
```

How does this compare to the original curve? **Answer** While still similar, the Cauchy link function differs from the linear regression more extremely and over the entire range.

So far we have used classification rate as our main metric for prediction purposes. This is the one we will use most this semester because it is the most relevant to multi-class prediction tasks, which will be our focus. However, there are other metrics which can be very useful when doing binary classification, particularly when we have a preference for a certain type of error (falsely classifying 1s as 0s, or vice-versa). Assume that we have two classes named 0 and 1, we can define the two quantities:

- Precision: For the observations we predict to be 1, what proportion actually are?
- Recall: What proportion of the total number of 1s did we correctly predict?

Notice that these two quantities are in competition with one another. If we predict that every flight would be delayed, we have a perfect recall but a very low precision. Likewise, predicting none of the flights to be delayed gives a perfect precision but a recall of 0%. We have another metrics that gives a balance between the two extremes called the F1 score. It is defined as the harmonic mean of the precision and recall:

F1 = 2 * (precision * recall) / (precision + recall)

In this section you are going to compute the precision recall, and F1 number for a specific model. Let’s start by building a logistic regression model using the variables: departure hour, arrival house, carrier, origin, destination, distance, and weekday. Print out the classification rate so that we have something to compare to.

```
model <- flights %>%
filter(train_id == "train") %>%
glm(delayed ~ arr_hour +
dep_hour +
carrier +
origin +
dest +
distance +
weekday,
data = ., family = binomial())
flights %>%
mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.5)) %>%
group_by(train_id) %>%
summarize(class_rate = mean(pred == delayed))
```

```
## # A tibble: 2 x 2
## train_id class_rate
## <chr> <dbl>
## 1 train 0.619
## 2 valid 0.598
```

Next, using the definitions above, compute the precision, recall, and F1 scores for the training and testing sets. You’ll probably need some of the methods described in the **Using R to Manipulate and Visualize Data** linked to from the website.

```
flights %>%
mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.5)) %>%
group_by(train_id) %>%
summarize(precision = sum(pred == 1 & delayed == 1) / sum(pred == 1),
recall = sum(pred == 1 & delayed == 1) / sum(delayed == 1)) %>%
mutate(f1 = 2 * (precision * recall) / (precision + recall))
```

```
## # A tibble: 2 x 4
## train_id precision recall f1
## <chr> <dbl> <dbl> <dbl>
## 1 train 0.623 0.620 0.621
## 2 valid 0.592 0.604 0.598
```

You should see that the model provides a balance of precision and recall; both should be very similar. Now, copy the code from your previous solution, but classify points as a delay if the predicted value is greater than 0.3 (rather than 0.5):

```
flights %>%
mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.45)) %>%
group_by(train_id) %>%
summarize(precision = sum(pred == 1 & delayed == 1) / sum(pred == 1),
recall = sum(pred == 1 & delayed == 1) / sum(delayed == 1)) %>%
mutate(f1 = 2 * (precision * recall) / (precision + recall))
```

```
## # A tibble: 2 x 4
## train_id precision recall f1
## <chr> <dbl> <dbl> <dbl>
## 1 train 0.595 0.737 0.659
## 2 valid 0.565 0.717 0.632
```

What happens to the precision, recall, and F1 score? **Answer** The precision drops a bit, the recall increases significantly, and the F1 score increases by a good amount as well.

Now, repeat with the cut-off 0.7:

```
flights %>%
mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.7)) %>%
group_by(train_id) %>%
summarize(precision = sum(pred == 1 & delayed == 1) / sum(pred == 1),
recall = sum(pred == 1 & delayed == 1) / sum(delayed == 1)) %>%
mutate(f1 = 2 * (precision * recall) / (precision + recall))
```

```
## # A tibble: 2 x 4
## train_id precision recall f1
## <chr> <dbl> <dbl> <dbl>
## 1 train 0.722 0.155 0.255
## 2 valid 0.706 0.149 0.246
```

What happens now? **Answer** The precision increases to 0.7 but the recall drops to 0.14. The F1 score is much less than before.

Can you think of applications where we would prefer each of these models?

Finally, see if you can improve on the best classification rate by changing the variables. Try to add interactions or polynomial terms. You can also wrap numeric variables in the function `factor()`

to convert them to a categorical variable if you think it will help.

**Note:** I found the best model to include a factor on month and weekday and polynomial expansion on the arrival and departure hours.

```
model <- flights %>%
filter(train_id == "train") %>%
glm(delayed ~ poly(arr_hour, 3) +
poly(dep_hour, 3) +
carrier +
origin +
dest +
distance +
factor(month) +
factor(weekday),
data = ., family = binomial())
flights %>%
mutate(pred = as.numeric(predict(model, newdata = ., type = "response") > 0.50)) %>%
group_by(train_id) %>%
summarize(class_rate = mean(pred == delayed))
```

```
## # A tibble: 2 x 2
## train_id class_rate
## <chr> <dbl>
## 1 train 0.626
## 2 valid 0.620
```