Class 03: The Lady Tasting Tea

2017-09-05

library(readr)
library(ggplot2)
library(dplyr)
library(viridis)

Tea Reviews

Today we are going to fill in some of the gaps left from the last class about building exploratory graphics and linear models. The title for this class comes from an experiment described in the 1935 text The Design of Experiments by Sir Ronald Fisher, one of the founding fathers of modern statistics.

We are going to look at a different dataset of teas. Specifically, tea reviews from the Adagio Tea website. I collected this dataset about 12 months ago, so it should be similar but not exactly the same as what is one the site today. Let’s read the data into R from my website as we have done with the mammals and abalone datasets:

tea <- read_csv("https://statsmaths.github.io/ml_data/tea.csv")

Opening the data in the data viewer, we see the required first two columns and the response of interest: the average score assigned to the tea by customers.

Variables available to predict the output are the type of tea, the number of reviews received the price of the tea. The latter is given in estimated cents per cup as reported on the site. We also have the full name of the tea.

Exploratory analysis

Univariate plots

Before doing anything else, let’s try to understand the distribution of each of the variables.

qplot(score, data = tea)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk unnamed-chunk-3

The score values are generally very high, with most of them above 88. All of the scores are whole integers, and the most common values are between 92 and 95.

qplot(price, data = tea)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk unnamed-chunk-4

The price variable is heavily skewed, with a few very expensive teas. Most are well under a quarter per cup.

qplot(num_reviews, data = tea)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk unnamed-chunk-5

The number of reviews also have a bit of skew, but not nearly as strongly as the price variable.

qplot(type, data = tea)

plot of chunk unnamed-chunk-6

There are twelve types of tea, with some having only a few samples and others having over thirty.

Bivariate plots

Now, we can proceed to bivariate plots showing the relationship between each variable and the response.

qplot(num_reviews, score, data = tea)

plot of chunk unnamed-chunk-7

There seems to be a slight positive relationship between the number of reviews and the score.

qplot(price, score, data = tea)

plot of chunk unnamed-chunk-8

Given the skew of the plot, it is hard to figure out the exact relationship between price and score.

Linear Models

Last time I used the geom_smooth function to put what I called a best-fit line onto the plot. We then used the lm function to estimate the slope and intercept of this line numerically. Let’s try to formalize this concept now.

The classical linear regression model assumes that the average value, or mean, of the response Y is a linear function of X. Symbolically, with the index i representing the i’th sample, this gives:

Similarly, we can write that Y is equal to a fixed linear effect dependent on X plus a random variable epsilon with zero mean.

The estimation task here is to find reasonable estimates for the alpha and beta components given pairs of observations (X, Y). There are many ways of doing this but by far the most common is to use what is known as Ordinary Least Squares or OLS. This selects the alpha and beta that minimize the squared errors implied by the linear model. As before, let’s write this down symbolically:

A natural question is to ask why we are interested in the squared errors. We could just as likely ask for the minimizer of the absolute value of the errors or the minimizer of the maximum errors.

I can offer two justifications of the squared error, one numerical and one statistical. Using the squared error, unlike the absolute errors or maximum errors, produces a smooth function. That is, a function that has an infinite number of derivatives at every point (although, all derivatives after 2 are equal to zero). This makes it easier to find a solution to the linear equation. Secondly, if the errors are distributed as a normal random variable the OLS estimator is equivalent to the maximum likelihood estimator (MLE).

Linear Models - Visually

In order to better understand linear models, it helps to see a picture. Below I have drawn a line through our dataset and indicated the errors (also known as residuals) that the ordinary least squares is concerned with minimizing. Note: don’t worry much about the code producing this graphic, concentrate just on the output for now.

tea$score_pred <- 89 + tea$num_reviews * 0.002
qplot(num_reviews, score, data = tea) +
  geom_line(aes(num_reviews, score_pred), color = "orange") +
  geom_segment(aes(xend = num_reviews, yend = score_pred), alpha = 0.5)

plot of chunk unnamed-chunk-9

Notice that this line under-guesses most of the score of teas, particularly if the number of reviews is low.

Brute Force

How can we figure out what the best alpha and beta are for this model? The most straightforward way would be to just try a large number of values and see which one minimizes the response. This is largely impractical but useful to see in this small example. It will also let you see some more involved R code.

To start, we construct variables to hold the best sum of squares, alpha, and beta values. We then create a double loop cycling over all combinations of alpha and beta and testing whether the given configuration gives a better sum of squares compared to our current best value.

best_ss <- Inf
best_a <- Inf
best_b <- Inf
for (a in seq(75, 120)) {
  for (b in seq(0.0001, 0.01, by = 0.0001)) {
    tea$score_pred <- a + tea$num_reviews * b
    ss <- sum((tea$score_pred - tea$score)^2, na.rm = TRUE)
    if (ss < best_ss) {
      best_a <- a
      best_b <- b
      best_ss <- ss
    }
  }
}
sprintf("best sum of squares=%03.02f at alpha=%03.01f and beta=%0.03f",
        best_ss, best_a, best_b)
## [1] "best sum of squares=660.63 at alpha=92.0 and beta=0.001"

Using lm

Now, let’s compare the brute force algorithm with the lm function:

model <- lm(score ~ num_reviews, data = tea)
model
## 
## Call:
## lm(formula = score ~ num_reviews, data = tea)
## 
## Coefficients:
## (Intercept)  num_reviews  
##   9.225e+01    9.527e-04

Our brute-force solution is actually quote close to this! Of course, it depended on me knowing approximately what the right slope and intercept should be. In this case that was easy graphically, but in more complex example it is a much more challenging task.

Let’s use the predict function again to fit this model to our data:

tea$score_pred <- predict(model, newdata = tea)

And now plot the fitted values. Does it visually correspond to where you would expect the best fit line to run?

qplot(num_reviews, score, data = tea) +
  geom_point(aes(num_reviews, score_pred), color = "orange")

plot of chunk unnamed-chunk-13

Evaluating models

How should be evaluate the predictive ability of a model? The method of ordinary least squares suggests that we should consider the sum of squared errors. A challenge with this is that the value of a sum will grow in proportion to how many observations there are. This makes it hard to compare results across data sources and subsets. A simple solution is to use the average value of the squared errors, known as the mean squared error or MSE:

mean((tea$score - tea$score_pred)^2, na.rm = TRUE)
## [1] 4.127367

This works as a general measurement of error, but the units are a bit strange as they are given in squared scores. A simple solution to this exists by taking the square root of the MSE, resulting in the root mean squared error (RMSE):

sqrt(mean((tea$score - tea$score_pred)^2, na.rm = TRUE))
## [1] 2.031592

You’ll find references to the RMSE through the literature on machine learning as it is by far the most common measurement for predictiveness of continuous responses.

How good of a result is our RMSE of 1.99? It’s hard to say for sure, but one easy thing to do is to compare it to the simplest possible model: the one that predicts every score will be equal to the average score. The RMSE of this estimator is given by:

sqrt(mean((tea$score - mean(tea$score, na.rm = TRUE))^2, na.rm = TRUE))
## [1] 2.104542

So, we have improved on the baseline model, but not by a large amount. Note: if you are familiar with the standard error, the RMSE of the mean is almost equal to the standard error of the response.

Multivariate linear regression

The linear regression model I just introduced is known as simple linear regression because there is only one explanatory variable. We can easy consider multivariate models; for instance, we can be write a two variable model mathematically as follows:

The geometric interpretation of this is that we have plane in place of the line in the simple linear regression model.

When I teach an applied statistics course, we spend a lot of time working up to multivariate regression models. The interpretation of multivariate models can quickly become quite complex. However, using multivariate models in statistical learning is much easier to understand: each slope coefficient (beta and gamma here) corresponds to a weight placed on how much the response changes with each predictor variable.

Fitting multivariate models is also quite easy with the lm function. Simply add the variables together that you would like to use for prediction. Here we use both the number of reviews and the price of the tea:

model <- lm(score ~ num_reviews + price, data = tea)
tea$score_pred <- predict(model, newdata = tea)
model
## 
## Call:
## lm(formula = score ~ num_reviews + price, data = tea)
## 
## Coefficients:
## (Intercept)  num_reviews        price  
##   91.476718     0.001308     0.021595

Using the color aesthetics we learned last class, we can visualize the predicted values of this model. Don’t get hung up on the code below; concentrate on the output.

tea_grid <- expand.grid(seq(min(tea$num_reviews), max(tea$num_reviews), length.out = 20),
                        seq(min(tea$price), max(tea$price), length.out = 20))
tea_grid <- data_frame(num_reviews = tea_grid[,1],
                       price = tea_grid[,2])
tea_grid$score_pred <- predict(model, newdata = tea_grid)
qplot(price, num_reviews, data = tea_grid, color = score_pred,
      size = I(5)) +
  scale_color_viridis() +
  theme_minimal()

plot of chunk unnamed-chunk-18

Also, predictiveness of the model has been greatly improved over the simple linear regression from before:

sqrt(mean((tea$score - tea$score_pred)^2, na.rm = TRUE))
## [1] 1.9514

Notes on Labs

  • you should see your completion marker increased from 0/28 to 2/28
  • you should also see that your first two labs now have scores in the file results.md
  • you may have entries in results.md, but still missing the completion scores
  • if there are any issues:
    • make sure that the files have correct extensions (Rmd, nb.html, and csv), including capitalization
    • if files are misnamed at the moment, please move them to the correct ones
    • I will re-update the completions indicator this evening; please try to have labs 1 and 2 fixed by then
  • due to naming issues, I’ve decided it will be easier to post lab assignments on the website rather than pushing the templates to GitHub
  • some version of R and RStudio produce .html files rather than nb.html files; this is fine - just put the html files online instead. The grading script has been modified to catch this.
  • if you had error messages about “curl” when downloading files, you can run install.packages("devtools") once to get the packages it is complaining about