Due Date: 01 February 2023
This page outlines the instructions for the first project. You should
have a file project01.Rmd
in your RStudio workspace where
you can work on the project. I find that students prefer having a
relatively consistent format for the projects, so I will attempt to keep
the general format the same throughout the semester.
On the due-date, you or your group is responsible for completing two elements:
As described on the syllabus, the project will be graded using a rubric, which can be found here.
This project uses a collection of book reviews from the website Good Reads. The data have been taken from the following paper:
Mengting Wan, Julian McAuley, “Item Recommendation on Monotonic Behavior Chains”, in RecSys’18.
Groups will be assigned different genres to look at, so you will have a slightly different analysis compared to the rest of the class. You can read more about the collection as a whole here.
There are two different prediction tasks that you may look at. The first is to classify the score of the book on a scale from 1-5. The second is to predict the gender of the author of the book being described. Here are some things you might consider addressing:
At the end of the day, though, your task is an open-ended one. I want you to explore the data using the techniques we have learned so far and then produce a short presentation showing your results to the rest of the class.
Your presentation should be in the form of a set of slides. You can build these in whatever software you would like, but please create a PDF version to submit on Box by the project deadline. I will not accept projects in other file formats.
The exact format of the slides is up to you, however there should be a final slide titled “Synthesis” that summarises what you have learned about the data from your analysis. It should take a big picture view of the analysis and not be overly focused on the models.
You will find that most (perhaps all) of the results you want to show
are tables and example reviews. Do not use screen shots for these!
Screen shots are messy and not ideal. Instead, I suggest using the
function dsst_clipboard()
to copy information to the
clipboard. This can then be pasted into a spreadsheet program or (in
some cases) directly into a presentation. For example:
dsst_neg_examples(model) %>% dsst_clipboard()
For visualisations, you should use the function ggsave
to store the most recent plot as a JPG or PNG file.
Groups are assigned different genres to look at in their project. You
should be able to download your data set from within the
project01.Rmd
file.
Group 1: children_b
Group 2: comics_graphic_b
Group 3: fantasy_paranormal_b
Group 4: history_biography_b
Group 5: mystery_thriller_crime_b
Group 6: poetry_b
Group 7: romance_b
Group 8: young_adult_b
While working through the project, I typically find that many groups ask for help writing the same bits of code. Any notes that I want to share about how to do specific tasks will be added here as we work through the project.