library(tidyverse) library(forcats) library(ggrepel) library(smodels) theme_set(theme_minimal()) options(dplyr.summarise.inform = FALSE) options(width = 77L)
We are going to start by looking at a data set of tea reviews. Specifically, tea reviews from the Adagio Tea website. I collected this data set a few years ago, so it should be similar but not exactly the same as what is one the site today. Let’s read the data into R:
<- read_csv(file.path("data", "tea.csv")) tea tea
## # A tibble: 238 x 5 ## score name type price num_reviews ## <dbl> <chr> <chr> <dbl> <dbl> ## 1 96 irish_breakfast black 10 3675 ## 2 95 earl_grey_bravo black 10 3520 ## 3 95 golden_monkey black 27 1125 ## 4 96 black_dragon_pearls black 32 1748 ## 5 95 yunnan_noir black 17 988 ## 6 95 earl_grey_moonlight black 10 2510 ## 7 93 english_breakfast black 17 1008 ## 8 92 keemun_concerto black 17 499 ## 9 95 yunnan_gold black 40 1094 ## 10 94 ceylon_sonata black 12 1525 ## # … with 228 more rows
Looking at the data in the data viewer, we see several variables. The goal is to predict the user score of each tea.
Variables available to predict the output are the type of tea, the number of reviews received the price of the tea. The latter is given in estimated cents per cup as reported on the site. We also have the full name of the tea, though that will not be very useful for prediction.
Before doing anything else, we will do some exploratory data analysis:
%>% tea ggplot(aes(x = score)) + geom_bar()
The score values are generally very high, with most of them above 88. All of the scores are whole integers, and the most common values are between 92 and 95.
%>% tea ggplot(aes(x = price)) + geom_histogram(bins = 15, color = "black", fill = "white")
The price variable is heavily skewed, with a few very expensive teas. Most are well under a quarter per cup.
%>% tea ggplot(aes(x = num_reviews)) + geom_histogram(bins = 15, color = "black", fill = "white")
The number of reviews also have a bit of skew, but not as strongly as the price variable.
%>% tea group_by(type) %>% summarise(sm_count()) %>% arrange(desc(count)) %>% mutate(type = fct_inorder(type)) %>% ggplot(aes(x = type, count)) + geom_col()
There are twelve types of tea, with some having only a few samples and others having over thirty.
Now, we can proceed to bi-variate plots showing the relationship between each variable and the response.
%>% tea ggplot(aes(num_reviews, score)) + geom_point()