Project 02 – Instructions

Due Date: 22 February 2023

General Instructions

This page outlines the instructions for the second project. You should have a file project02.Rmd in your RStudio workspace where you can work on the project. I find that students prefer having a consistent format for the projects, so I will attempt to keep the format the same throughout the semester.

Your group is responsible for completing two elements:

Slides presenting your results, also submitted in your Box folder.
A presentation based on your slide show. This should be 8-10 minutes in length.

The slides must be uploaded by the first day of presentations regardless of when you present. As described on the syllabus, the project will be graded using a rubric, which can be found here.

Specific Instructions

This project uses a collection of product reviews from Amazon. I created it from the archive here, produced by Jianmo Ni, Jiacheng Li, Julian McAuley. I have selected a collection of reviews from the most prolific reviewers. The classification task for the project is to be able to predict the author of a review; there are 25 authors in each dataset. Each group has been assigned a different product category to work with (see assignments below).

The goal of your project is to apply the methods we have developed so far to explore your corpus and understand what features distinguish the authors. Remember that the goal is not just to classify; we want to use the methods to understand the style of the authors so it can be useful to build a variety of models (some of which may be less predictive) in order to achieve this. Here are some things you might consider looking into:

How well can you classify the authors based on word usage?
Are any of the most predictive words surprising? You can use KWiC to figure out what is going on.
How well can you classify the authors based on unigrams/bigrams/trigrams of upos and/or xpos tags? How does this compare to the word usage? Compare the predictiveness of a local model with a penalized regression.
Are some authors particularly hard to classify? Easy to classify? Can you tell why?
Do you think that the particular products that each person reviews is the main driver of you model? How could you remove this factor (and can you)?
Can you summarize the results in any interesting way?

These are just some ideas to get you started. Feel free to go in a different direction if you find something interesting! Keep in mind that confusion matrices will be useful and that even classification rates around 50% are much better than random guessing when there are many classes.

At the end of the day, though, your task is an open-ended one. I want you to explore the data using the techniques we have learned so far and then produce a short presentation showing your results to the rest of the class.

Presentation

Your presentation should be in the form of a set of slides. You can build these in whatever software you would like, but please create a PDF version to submit on Box by the project deadline.

The exact format of the slides is up to you, however there should be a final slide titled “Synthesis” that summarises what you have learned about the data from your analysis. It should take a big picture view of the analysis and not be overly focused on the models.

You will find that most (perhaps all) of the results you want to show are tables and example reviews. Do not use screen shots for these! Screen shots are messy and not ideal. Instead, I suggest using the function dsst_clipboard() to copy information to the clipboard. This can then be pasted into a spreadsheet program or (in some cases) directly into a presentation. For example:

dsst_neg_examples(model) %>% dsst_clipboard()

For visualisations, you can either use the function ggsave to store the most recent plot as a JPG or PNG file or right/ctrl click on a plot and save the file.

Groups

Each group is working with a different product category. You should be able to download your data set from within the project02.Rmd file.

We will have:

Group 1: video_games

Group 2: grocery

Group 3: toys_games

Group 4: movies_tv

Group 5: tools

Group 6: kindle

Group 7: cds

Group 8: pet_supplies

General Feedback From Project 1

To summarize my in-class feedback to everyone from Project 1, try to keep the following things in mind as you prepare Project 2:

Make sure the slides are readable and understandable out of the context of your talk. But also, don’t crowd them out with text. Try to find a good balance.
The slides should include your results and conclusions. A good way to do this is to show a larger part of a table, but then highlight the most important results. You could use several techniques, such as shapes, bold text, colors, or arrows.
Remember to copy tables into a spreadsheet program rather than using screenshots. Format the numbers in a way that makes them easily readable by only including a reasonable number of significant digits.
Remember to use a mixture of different types of results, with specific examples mixed in with larger summaries such as a confusion matrix.
This is not a discussion of your methods; we don’t need to hear what you did that was not helpful for your analysis.
For full marks, the next project requires you to do something that somehow goes beyond the straightforward application of techniques from class. I can help with this, but only if you are already far along when we do the project workshop in class.
The most important part of the presentation is your synthesis slide. This is not the same as a summary slide.

Notes

You’ll likely find with this assignment that you want an alternative way of looking at the coefficients table. One approach is to make the table a tibble by setting the to_tibble = TRUE option, and then manipulating the results as in the following code, which assumes you already have a model called model:

dsst_coef(model$model, lambda_num = 40, to_tibble = TRUE) %>%
  filter(term != "(Intercept)") %>%
  pivot_longer(names_to = "label", values_to = "coef", cols = -c(term, MLN)) %>%
  filter(coef != 0) %>%
  mutate(direction = if_else(sign(coef) > 0, "positive", "negative")) %>%
  group_by(label, direction) %>%
  summarize(term = paste(term, collapse = " | ")) %>%
  pivot_wider(
    id_cols = "label",
    values_from = "term",
    names_from = "direction",
    values_fill = ""
  )

## # A tibble: 5 × 3
## # Groups:   label [5]
##   label     negative                                                 posit…¹
##   <chr>     <chr>                                                    <chr>  
## 1 Austen    "man | hand | face"                                      sister…
## 2 Dickens   ""                                                       reply …
## 3 Doyle     ""                                                       uncle …
## 4 Stevenson ""                                                       sea | …
## 5 Wells     "good | dear | friend | should | can | will | own | you… shout …
## # … with abbreviated variable name ¹positive

It doesn’t look great in the website here, but if you do it in RStudio it will give a very nice view of the coefficents. You could further copy into a spreadsheet program to format.