Due Date: 23 March 2021

General Instructions

This page outlines the instructions for the first project. You should have a file project02.Rmd in your RStudio workspace where you can work on the project. I find that students prefer having a consistent format for the projects, so I will attempt to keep the format the same throughout the semester.

On the due-date, your group is responsible for completing three elements:

  1. A short (about one-page single spaced) description of your work answering a subset of the questions below. Please submit as a Google Doc in your shared group Google Drive folder.
  2. A Google Slides presentation of your results, also submitted in your shared drive folder.
  3. Giving a presentation based on your slide show. This should be 8-10 minutes in length. I suggest having one member drive the slides and the others rotate describing the results.

As described on the syllabus, the project will be graded as either Satisfactory or Unsatisfactory. I will provide additional feedback that you can address in the next project.

Specific Instructions

This project uses a collection of product reviews from Amazon. I created it from the archive here, produced by Jianmo Ni, Jiacheng Li, Julian McAuley. I have selected a collection of reviews from the most prolific reviewers. The classification task for the project is to be able to predict the author of a review. Each group has been assigned a different product category to work with (see assignments below).

The goal of your project is to apply the methods we have developed so far to explore your corpus and understand what features distinguish the authors. Remember that the goal is not just to classify; we want to use the methods to understand the style of the authors so it can be useful to build a variety of models (some of which may be less predictive) in order to achieve this. Here are some things you should consider looking into:

These are just some ideas to get you started. Feel free to go in a different direction if you find something interesting! Keep in mind that confusion matrices will be useful and that even classification rates around 30% are much better than random guessing when there are many classes.


Each group is working with a different product category. You should be able to download your data set from within the project02.Rmd file.

Group 1: grocery

Group 2: movies_tv

Group 3: kindle

Group 4: pet_supplies

Group 5: video_games

Group 6: toys_games

Group 7: tools

Group 8: cds


While working through the project, I typically find that many groups ask for help writing the same bits of code. Any notes that I want to share about how to do specific tasks will be added here as we work through the project.