Project 1

Due Date: Noon, 01 October 2020

Total Points 60

This page outlines the instructions for the first project. You should have a file project01.Rmd in your RStudio Cloud workspace where you can work on the project. Also, you should have been invited to a Richmond Box folder; please upload your work in the “project01” directory within this shared folder. Finally, note that there is a presentation component of the project that will be given in class on the 1 October.


For this project you will be creating a small dataset, producing a single annotated visualization, and telling a written and oral story about the plot. The goal is to illustrate command of the core notebooks (1-8) describing the tools we will be working with during the rest of the semester. The story should help us get to know you a bit better; it does not have to be particularly surprising or insightful regarding the dataset, which will be quite small compared to our other projects.

You have three options for constructing a dataset; select whichever one is the most interesting to you. Regardless of the option chosen, you should have a dataset with at least 20 observations and four variables. Keep in mind the notes from notebook07 as you structure the dataset.

Television: Create a dataset of your favorite television show, with one observation per episode. Try to pick a show that has an easy to find Wikipedia page for individual episodes. Include as many seasons as needed to get at least 20 rows (though feel free to include more if needed). Record the following information:

  • episode title
  • the season number
  • the episode number (overall or within the season)
  • original air date
  • number of viewers
  • a personal rating of how much you enjoy each episode

Sports: Create a dataset where each row corresponds to one game played by a particular sports team. To get the minimum number of data points, this could be one season of the NFL (with pre-season), two seasons of college football, a long play-off run in the NBA or NHL, or a similar length of time for your favorite sport. Record the following information:

  • opposing team
  • game number (if doing a season); season (if multiple seasons); series (if doing an NBA/NHL playoff run); or similar metric
  • date of the game
  • score differential (your team has a positive number if they win; negative otherwise)
  • attendance, viewership numbers, or other metric of interest; if this is not possible (though it should be for most U.S. sports) create your own metric of how exciting the game was

Recipes: Create a dataset of recipes. I suggest grabbing them from a website such as, but another site could work as well. Record the following information:

  • name of the dish
  • a categories for the dishes (side/main/dessert, type of food, or whatever else makes sense for your data)
  • total number of reviews
  • total number of ratings
  • overall rating (number of stars or other metric)

Feel free to add any additional variables that you think will help you tell a story with the dataset.

Once you have the dataset constructed, export as a csv file and read it into R. Spend some time producing an interesting visualization, with a particular focus on telling a story about yourself. This could be your favorite TV episode(s), something you remember from a specific sports game, or a particular memory of cooking a food item. Make sure to add useful labels and follow the guidelines in notebook08. Save the graphic and add some manual annotations. Then, write a short description (around 250 words) describing the data and telling a story about the plot in your favorite text editor. Finally, construct a data dictionary as an additional table and export this as a csv file as well.

When you are finished, export the essay as a pdf, and upload it along with the annotated graphic (should be one of: pdf, png, jpg), csv file of your data, and the csv file of your data dictionary to box. Be prepared to give a 3 minute presentation about your plot in class on the project’s due date. You do not have to upload your R script.


The project will be graded out of 60 points, according to the following rubric:

  • 5 points all files uploaded correctly and in the correct format
  • 10 points the constructed dataset follows the specifications above
  • 5 points the constructed dataset follows the guidelines in Notebook07
  • 5 points the data dictionary is a clear and complete
  • 20 points the data visualization follows the guidelines in Notebook08 for telling a story with data; specifically, it uses a reasonable plot-type, avoids clutter, focuses attention, includes labels and annotations, and makes good use of color and (when needed) scales
  • 10 points the data-driven story is free of most typos, clearly relates to the visualization, and illustrates thoughtful consideration of the data and what it is able to show
  • 5 points oral presentation has been practiced, tells an interesting story, and lasts approximately three minutes

You will receive a grade for your work through the shared Box folder. A current participation grade will also be included.

Coding Notes

We will be going more in-depth about ways of working with date and date-time data in the coming weeks. Some of you may, however, need to work with date objects in the plot for this project. Let’s say, for example you have recorded the day, month, and year of some events:

## # A tibble: 6 x 4
##   class_day class_month class_year topic   
##       <dbl>       <dbl>      <dbl> <chr>   
## 1        27           8       2020 Intro   
## 2         1           9       2020 Graphics
## 3         3           9       2020 Graphics
## 4         8           9       2020 Database
## 5        10           9       2020 Database
## 6        15           9       2020 Database

By loading the lubridate package, we can use the mutate function to create a new column that stores the full date as it’s own variable like this


data_science <- data_science %>%
  mutate(class_date = make_date(class_year, class_month, class_day))
## # A tibble: 6 x 5
##   class_day class_month class_year topic    class_date
##       <dbl>       <dbl>      <dbl> <chr>    <date>    
## 1        27           8       2020 Intro    2020-08-27
## 2         1           9       2020 Graphics 2020-09-01
## 3         3           9       2020 Graphics 2020-09-03
## 4         8           9       2020 Database 2020-09-08
## 5        10           9       2020 Database 2020-09-10
## 6        15           9       2020 Database 2020-09-15

This column can be used within plots, like this:

data_science %>%
  ggplot(aes(class_date, topic)) +
    geom_point() +