Cereal Data

Today, we start by looking at a collection of breakfast cereals:

cereal <- read_csv("https://statsmaths.github.io/stat_data/cereal.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   brand = col_character(),
##   sugar = col_integer(),
##   score = col_double(),
##   shelf = col_character()
## )

With variables:

Produce a histogram of the sugar variable.

ggplot(cereal, aes(sugar)) +
  geom_histogram(color = "black", fill = "white", bins = 10)

Now, compute the standard deviation of the variable sugar:

sd(cereal$sugar)
## [1] 4.378656

What are the units of this measurement?

Answer: grams

Now, compute the deciles of the variable score:

deciles(cereal$score)
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
## 18.0 28.0 31.0 34.5 37.0 40.0 42.0 48.0 53.0 58.0 84.0

What is the value of the 30th percentile. Describe what this means in words:

Answer: 34.5. It means that approximately 30% of cereals have an overall score less than 34.5 and 70% have a score greater than 34.5.

Produce a boxplot of score and brand.

ggplot(cereal, aes(brand, score)) +
  geom_boxplot() +
  coord_flip()

Which brand seems to have the healthiest cereals?

Answer: Nabisco.

Produce a boxplot of score and shelf.

ggplot(cereal, aes(shelf, score)) +
  geom_boxplot() +
  coord_flip()

Produce a boxplot of sugar and shelf.

ggplot(cereal, aes(shelf, sugar)) +
  geom_boxplot() +
  coord_flip()

If I want a healthy but reasonably sweet cereal which shelf would be the best to look on?

Answer: The top and bottom shelves are similarly healthy, but the top has sweeter options, so the top shelf would be your best bet.

Tea Reviews

Next, we will take another look at a dataset of tea reviews that I used in a previous lecture:

tea <- read_csv("https://statsmaths.github.io/stat_data/tea.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   type = col_character(),
##   score = col_integer(),
##   price = col_integer(),
##   num_reviews = col_integer()
## )

With variables: - name: the full name of the tea - type: the type of tea. One of: - black - chai - decaf - flavors - green - herbal - masters - matcha - oolong - pu_erh - rooibos - white - score: user rated score; from 0 to 100 - price: estimated price of one cup of tea - num_reviews: total number of online reviews

Draw a scatterplot with num_reviews (x-axis) against score (y-axis) and add a regression line (recall: geom_bestfit()).

ggplot(tea, aes(num_reviews, score)) +
  geom_point() +
  geom_bestfit()