Due Date: 22 February 2023
This page outlines the instructions for the second project. You
should have a file project02.Rmd
in your RStudio workspace
where you can work on the project. I find that students prefer having a
consistent format for the projects, so I will attempt to keep the format
the same throughout the semester.
Your group is responsible for completing two elements:
The slides must be uploaded by the first day of presentations regardless of when you present. As described on the syllabus, the project will be graded using a rubric, which can be found here.
This project uses a collection of product reviews from Amazon. I created it from the archive here, produced by Jianmo Ni, Jiacheng Li, Julian McAuley. I have selected a collection of reviews from the most prolific reviewers. The classification task for the project is to be able to predict the author of a review; there are 25 authors in each dataset. Each group has been assigned a different product category to work with (see assignments below).
The goal of your project is to apply the methods we have developed so far to explore your corpus and understand what features distinguish the authors. Remember that the goal is not just to classify; we want to use the methods to understand the style of the authors so it can be useful to build a variety of models (some of which may be less predictive) in order to achieve this. Here are some things you might consider looking into:
upos
and/or xpos
tags? How does this compare to the word usage? Compare the
predictiveness of a local model with a penalized regression.These are just some ideas to get you started. Feel free to go in a different direction if you find something interesting! Keep in mind that confusion matrices will be useful and that even classification rates around 50% are much better than random guessing when there are many classes.
At the end of the day, though, your task is an open-ended one. I want you to explore the data using the techniques we have learned so far and then produce a short presentation showing your results to the rest of the class.
Your presentation should be in the form of a set of slides. You can build these in whatever software you would like, but please create a PDF version to submit on Box by the project deadline.
The exact format of the slides is up to you, however there should be a final slide titled “Synthesis” that summarises what you have learned about the data from your analysis. It should take a big picture view of the analysis and not be overly focused on the models.
You will find that most (perhaps all) of the results you want to show
are tables and example reviews. Do not use screen shots for these!
Screen shots are messy and not ideal. Instead, I suggest using the
function dsst_clipboard()
to copy information to the
clipboard. This can then be pasted into a spreadsheet program or (in
some cases) directly into a presentation. For example:
dsst_neg_examples(model) %>% dsst_clipboard()
For visualisations, you can either use the function
ggsave
to store the most recent plot as a JPG or PNG file
or right/ctrl click on a plot and save the file.
Each group is working with a different product category. You should
be able to download your data set from within the
project02.Rmd
file.
We will have:
Group 1: video_games
Group 2: grocery
Group 3: toys_games
Group 4: movies_tv
Group 5: tools
Group 6: kindle
Group 7: cds
Group 8: pet_supplies
To summarize my in-class feedback to everyone from Project 1, try to keep the following things in mind as you prepare Project 2:
You’ll likely find with this assignment that you want an alternative
way of looking at the coefficients table. One approach is to make the
table a tibble by setting the to_tibble = TRUE
option, and
then manipulating the results as in the following code, which assumes
you already have a model called model
:
dsst_coef(model$model, lambda_num = 40, to_tibble = TRUE) %>%
filter(term != "(Intercept)") %>%
pivot_longer(names_to = "label", values_to = "coef", cols = -c(term, MLN)) %>%
filter(coef != 0) %>%
mutate(direction = if_else(sign(coef) > 0, "positive", "negative")) %>%
group_by(label, direction) %>%
summarize(term = paste(term, collapse = " | ")) %>%
pivot_wider(
id_cols = "label",
values_from = "term",
names_from = "direction",
values_fill = ""
)
## # A tibble: 5 × 3
## # Groups: label [5]
## label negative posit…¹
## <chr> <chr> <chr>
## 1 Austen "man | hand | face" sister…
## 2 Dickens "" reply …
## 3 Doyle "" uncle …
## 4 Stevenson "" sea | …
## 5 Wells "good | dear | friend | should | can | will | own | you… shout …
## # … with abbreviated variable name ¹positive
It doesn’t look great in the website here, but if you do it in RStudio it will give a very nice view of the coefficents. You could further copy into a spreadsheet program to format.