Due Date: 06 April 2021 08 April 2021

General Instructions

This page outlines the instructions for the first project. You should have a file project03.Rmd in your RStudio workspace where you can work on the project. I find that students prefer having a consistent format for the projects, so I will attempt to keep the format the same throughout the semester.

On the due-date, your group is responsible for completing three elements:

  1. A short (about one-page single spaced) description of your work answering a subset of the questions below. Please submit as a Google Doc in your shared group Google Drive folder.
  2. A Google Slides presentation of your results, also submitted in your shared drive folder.
  3. Giving a presentation based on your slide show. This should be 8-10 minutes in length. I suggest having one member drive the slides and the others rotate describing the results.

As described on the syllabus, the project will be graded as either Satisfactory or Unsatisfactory. I will provide additional feedback that you can address in the next project.

Specific Instructions

The data for this project is similar to the previous one, but this time reviews come from Yelp. I created the data set based on what is provided by the Yelp Open Dataset project. As with before, I have selected reviewers with a large number of reviews. However, this time you have a number of different variables to work with. Variables that are available to focus on (either as a prediction or as an unsupervised grouping variable) are:

  1. author gender (binary categories; automatically guessed using first names; yes there are a lot of problems with this, but still interesting to look at)
  2. author name (100 authors in each dataset)
  3. stars (1-5)
  4. business categories (as in the notes)
  5. business names

As with the first two projects, you are encouraged to take the data in whatever direction you find most interesting. The only thing I require is that you focus a non-trivial amount of your work on integrating the unsupervised approaches covered in Notebook13 into your analysis. Note that gender has too few categories to use unsupervised on its own and business names likely have too many to do supervised learning.

If you need some questions to get started, I am happy to provide some of these. However, there are so many directions to go in, I only want to do this if you are struggling with an idea.


Each group is working with a different city’s data. You should be able to download your data set from within the project03.Rmd file.

Group 1: charlotte

Group 2: calgary

Group 3: pittsburgh

Group 4: phoenix

Group 5: las-vegas

Group 6: madison

Group 7: cleveland

Group 8: toronto


While working through the project, I typically find that many groups ask for help writing the same bits of code. Any notes that I want to share about how to do specific tasks will be added here as we work through the project.