## Cereal Data

Today, we start by looking at a collection of breakfast cereals:

``cereal <- read_csv("https://statsmaths.github.io/stat_data/cereal.csv")``
``````## Parsed with column specification:
## cols(
##   name = col_character(),
##   brand = col_character(),
##   sugar = col_integer(),
##   score = col_double(),
##   shelf = col_character()
## )``````

With variables:

• name: name of the specific cereal
• brand: name of the cerealâ€™s manufacturer
• sugar: amount of sugar per serving (g)
• score: healthiness score; 0-100; 100 is the best
• shelf: what shelf the cereal is typically stocked on in the store

Produce a histogram of the sugar variable.

``````ggplot(cereal, aes(sugar)) +
geom_histogram(color = "black", fill = "white", bins = 10)``````

Now, compute the standard deviation of the variable `sugar`:

``sd(cereal\$sugar)``
``## [1] 4.378656``

What are the units of this measurement?

Now, compute the deciles of the variable `score`:

``deciles(cereal\$score)``
``````##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100%
## 18.0 28.0 31.0 34.5 37.0 40.0 42.0 48.0 53.0 58.0 84.0``````

What is the value of the 30th percentile. Describe what this means in words:

Answer: 34.5. It means that approximately 30% of cereals have an overall score less than 34.5 and 70% have a score greater than 34.5.

Produce a boxplot of score and brand.

``````ggplot(cereal, aes(brand, score)) +
geom_boxplot() +
coord_flip()``````

Which brand seems to have the healthiest cereals?

Produce a boxplot of score and shelf.

``````ggplot(cereal, aes(shelf, score)) +
geom_boxplot() +
coord_flip()``````

Produce a boxplot of sugar and shelf.

``````ggplot(cereal, aes(shelf, sugar)) +
geom_boxplot() +
coord_flip()``````

If I want a healthy but reasonably sweet cereal which shelf would be the best to look on?

Answer: The top and bottom shelves are similarly healthy, but the top has sweeter options, so the top shelf would be your best bet.

## Tea Reviews

Next, we will take another look at a dataset of tea reviews that I used in a previous lecture:

``tea <- read_csv("https://statsmaths.github.io/stat_data/tea.csv")``
``````## Parsed with column specification:
## cols(
##   name = col_character(),
##   type = col_character(),
##   score = col_integer(),
##   price = col_integer(),
##   num_reviews = col_integer()
## )``````

With variables: - name: the full name of the tea - type: the type of tea. One of: - black - chai - decaf - flavors - green - herbal - masters - matcha - oolong - pu_erh - rooibos - white - score: user rated score; from 0 to 100 - price: estimated price of one cup of tea - num_reviews: total number of online reviews

Draw a scatterplot with num_reviews (x-axis) against score (y-axis) and add a regression line (recall: `geom_bestfit()`).

``````ggplot(tea, aes(num_reviews, score)) +
geom_point() +
geom_bestfit()``````