Due Date: 17 April 2023
This page outlines the instructions for the first project. You should
have a file project04.Rmd
in your RStudio workspace where
you can work on the project. I find that students prefer having a
consistent format for the projects, so I will attempt to keep the format
the same throughout the semester.
For this project, I expect everyone to work on their own. You are responsible for completing two elements:
The slides must be uploaded by the first day of presentations regardless of when you present. As described on the syllabus, the project will be graded using a rubric, which can be found here.
For this project, part of the assignment is constructing your data
set from Wikipeda. This can be done with the code in
project04-create.Rmd
. Following the method used in
Notebook11, you will start with a small set of pages, and create a
corpus of pages by following the links from those starting page(s).
The set of documents you create is completely up to you. You may, for example, choose to start with one option, and then modify it based on the initial results. I will come around and make sure everyone is on the right track. If you need a suggestion, I would suggest looking at “list” pages, such as Early modern universities, Sovereign States, U.S. Cities, or Fantasy Novels.
You can even work with a language other than English, but please check with me as I can offer guidance on what other languages are complete enough (and have good-enough parsers) to work for the project. Ultimately, you should aim for somewhere in the range of 100-1500 pages.
This project does not have a predictive modeling task associated to it. The goal is simply to explain your corpus to the rest of the class using the techniques we have learned, such as topic models, document clustering, TF-IDF, and KWiC. Consider including interesting visualizations in addition to tables and keep in mind that it is better to give a focused and interesting presentation, rather than a boring and encyclopedic one.
While working through the project, I typically find that many students ask for help writing the same bits of code. Any notes that I want to share about how to do specific tasks will be added here as we work through the project.