library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)

Tea Reviews

We are going to start by looking at a data set of tea reviews. Specifically, tea reviews from the Adagio Tea website. I collected this data set a few years ago, so it should be similar but not exactly the same as what is one the site today. Let’s read the data into R:

tea <- read_csv(file.path("data", "tea.csv"))
tea
## # A tibble: 238 x 5
##    score name                type  price num_reviews
##    <dbl> <chr>               <chr> <dbl>       <dbl>
##  1    96 irish_breakfast     black    10        3675
##  2    95 earl_grey_bravo     black    10        3520
##  3    95 golden_monkey       black    27        1125
##  4    96 black_dragon_pearls black    32        1748
##  5    95 yunnan_noir         black    17         988
##  6    95 earl_grey_moonlight black    10        2510
##  7    93 english_breakfast   black    17        1008
##  8    92 keemun_concerto     black    17         499
##  9    95 yunnan_gold         black    40        1094
## 10    94 ceylon_sonata       black    12        1525
## # … with 228 more rows

Looking at the data in the data viewer, we see several variables. The goal is to predict the user score of each tea.

Variables available to predict the output are the type of tea, the number of reviews received the price of the tea. The latter is given in estimated cents per cup as reported on the site. We also have the full name of the tea, though that will not be very useful for prediction.

Exploratory analysis

Univariate plots

Before doing anything else, we will do some exploratory data analysis:

tea %>%
  ggplot(aes(x = score)) +
    geom_bar()

The score values are generally very high, with most of them above 88. All of the scores are whole integers, and the most common values are between 92 and 95.

tea %>%
  ggplot(aes(x = price)) +
    geom_histogram(bins = 15, color = "black", fill = "white")

The price variable is heavily skewed, with a few very expensive teas. Most are well under a quarter per cup.

tea %>%
  ggplot(aes(x = num_reviews)) +
    geom_histogram(bins = 15, color = "black", fill = "white")

The number of reviews also have a bit of skew, but not as strongly as the price variable.

tea %>%
  group_by(type) %>%
  summarise(sm_count()) %>%
  arrange(desc(count)) %>%
  mutate(type = fct_inorder(type)) %>%
  ggplot(aes(x = type, count)) +
    geom_col()

There are twelve types of tea, with some having only a few samples and others having over thirty.

Bivariate plots

Now, we can proceed to bi-variate plots showing the relationship between each variable and the response.

tea %>%
  ggplot(aes(num_reviews, score)) +
    geom_point()