Due Date: 23 March 2021
This page outlines the instructions for the first project. You should have a file project02.Rmd
in your RStudio workspace where you can work on the project. I find that students prefer having a consistent format for the projects, so I will attempt to keep the format the same throughout the semester.
On the due-date, your group is responsible for completing three elements:
As described on the syllabus, the project will be graded as either Satisfactory or Unsatisfactory. I will provide additional feedback that you can address in the next project.
This project uses a collection of product reviews from Amazon. I created it from the archive here, produced by Jianmo Ni, Jiacheng Li, Julian McAuley. I have selected a collection of reviews from the most prolific reviewers. The classification task for the project is to be able to predict the author of a review. Each group has been assigned a different product category to work with (see assignments below).
The goal of your project is to apply the methods we have developed so far to explore your corpus and understand what features distinguish the authors. Remember that the goal is not just to classify; we want to use the methods to understand the style of the authors so it can be useful to build a variety of models (some of which may be less predictive) in order to achieve this. Here are some things you should consider looking into:
upos
and/or xpos
tags? How does this compare to the word usage? Compare the predictiveness of a local model with a penalized regression.These are just some ideas to get you started. Feel free to go in a different direction if you find something interesting! Keep in mind that confusion matrices will be useful and that even classification rates around 30% are much better than random guessing when there are many classes.
Each group is working with a different product category. You should be able to download your data set from within the project02.Rmd
file.
Group 1: grocery
Group 2: movies_tv
Group 3: kindle
Group 4: pet_supplies
Group 5: video_games
Group 6: toys_games
Group 7: tools
Group 8: cds
While working through the project, I typically find that many groups ask for help writing the same bits of code. Any notes that I want to share about how to do specific tasks will be added here as we work through the project.