Project IV: Statistical Programming Tutorial

Due: 2018-12-04 (start of class)

Starter code: project-iv.Rmd, project-iv-lab.Rmd

Rubric: project-iv-rubric.csv

For this project you are going to select a topic in statistical programming or data analysis that we have not covered. You will create a set of class tutorial notes and a lab similar to the ones we have used this semester. During the last week of the semester, you will give a brief presentation of the tutorial. I will make all of these available on the class website so that everyone can read and benefit from them.

I am open to other options, but here are some ideas that would be good topics.

We will discuss other possibilities in class, but here are several options of topics for the project:

  • using the glm function to fit a logistic regression
  • using the glm function to fit a Poisson regression
  • how to visualize network/graph data (igraph)
  • using the melt function (see reshape)
  • using the gather function (see tidyr)
  • loading, writing, and basic manipulation of image data
  • running a topic model over a corpus of texts
  • running and tuning a random forest model (randomforest)
  • running and visualizing a gradient boosted trees (see gbm)
  • dealing with missing values
  • how to make interactive graphics with ggplotly
  • making use of “base” graphics (see the function plot)
  • using the penalized lasso (glmnet)
  • fitting an autoregressive model (see the arima() function)
  • fitting a moving averages model (see the arima() function)
  • using the Kolmogorov-Smirnov test (see ks.test())

Note that you should, as part of the tutorial, find 1-2 datasets that you can demonstrate the material with and ask questions on. I suggest setting up your lab to include roughly 10 questions. I will meet with everyone while working on this project to make sure the scope of the tutorial is neither too narrow nor too broad.