library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)
library(cleanNLP)
library(glmnet)
library(Matrix)
library(xgboost)
library(stringi)
library(magrittr)

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)
options(sparse.colnames = TRUE)

Amazon Authorship Data: Music

I created enough data for 8 class groups, but we ended up having only 7. I thought it may be more interesting in today’s notes to use the 8th dataset to illustrate some new points. This group uses reviews of Music CDs (yes, they still exist).

amazon <- read_csv(file.path("data", sprintf("%s.csv.gz", "cds")))
token <- read_csv(file.path("data", sprintf("%s_token.csv.gz", "cds")))

In the notes, we will start by seeing a new kind of predictive model that can be applied to this data set. Then, we will look at a few types of visualizations that you might find useful for Project 2.

Local Models

All of the models that we have looked at up to this point derive from linear regression. We have seen that we can extend this model in a number of ways. Using a link function produced generalized linear regression, allowing us to do classification with logistic regression. Parameter expansion extends the linear regression model to cases where there are interactions or non-linear relationships between variables. Adding a penalty term produces estimators such as the lasso, ridge, and elastic net estimators. All of these allow for working with a large collection of variables without overfitting to the training data.

We could continue on this trend and learn even more extensions of linear regression. Linear discriminant analysis and support vector machines, for example, can be derived as a geometric extension of logistic regression. Fixed and mixed effects models allow for more control in the way that correlated observations are used in training. Generalized additive models provide a more complete and adaptive form of feature expansion for find non-linear relationships. Bayesian models allow for more complex relationships between parameters in the model and measurement over how we understand the output.

All of these approaches, though, approach the prediction problem in the same fundamental way. Weights are assigned to each feature variable (column of X) and the final prediction is determined by combining all of these effects together. I call all of these global models. We could spend more time learning more examples, but I think there are diminishing returns. It is unlikely the other alternatives will produce consistently better estimators than the ones we have already learned.

What I want instead to spend time looking at this week are local models. These are completely different in approach, working off of this idea: If we want to make a prediction for a new observation, just look at the most common classes of points that are “close” to new point. So, instead of learning global weights, just use distances and compare to the training data that we already have.

The most straightforward local model, which we will look at today, is called k-nearest neighbors, or KNN. To make predictions on a new point given a training dataset, take the k closest points in the training dataset and take the most dominant class (in the case of ties, remove the farther points until their is a winner). You can also apply this to regression problems by taking the average of the k neighbors.

Let’s see how to do this with our data in R. We’ll create a set of frequencies of the 200 most common words in the dataset:

X <- token %>%
  cnlp_utils_tf(doc_set = amazon$doc_id,
                min_df = 0.005,
                max_df = 1,
                max_features = 200,
                token_var = "lemma")

In order to run the KNN algorithm, we need to create a response vector and training matrix:

y <- amazon$user_id
y_train <- y[amazon$train_id == "train"]
X_train <- X[amazon$train_id == "train",]

Then, we load the package FNN (Fast nearest neighbors) and use the knn function. We start by setting k equal to 4. Unlike linear regression, there is no specific model per say. The algorithm jumps right to the predictions, which we will save as a new vector in R.

library(FNN)
y_hat <- knn(X_train, X, y_train, k = 4)

After the algorithm finishes, we can see how predictive the model is:

amazon %>%
  mutate(pred = y_hat) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(user_id == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.819
## 2 valid         0.713

It predicts the correct class 71% of the time. Not too bad given that there are 25 different authors. We could tweak the parameter k to try to get a bit better.

This is a nice example of a completely different model that can help show how well your other models are working. It has a very different approach and is best when given a relatively dense matrix (as above, lot’s of non-zeros) and particularly good when there are tricky interactions that are hard to get with global models. There is not much more that we can do with it here, though. Next time we will see a different local model that produces more interesting importance scores that can help describe the most important variables in a local model approach.

A few Project 2 Visualizations and Tables

Better coefficents

You may have notices that some of the users in some of the data sets change their user names, so you’ll get better predictions using user_id. However, it is nice to make visualizations with the full names, which are a bit more interesting. I will make a look-up table here:

uname_all <- distinct(amazon, user_name, user_id) %>% arrange(desc(user_id))
uname <- distinct(amazon, user_name, user_id) %>% filter(!duplicated(user_id))

To start, let’s see a build a penalized regression model using lemmas:

X <- token %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "lemma"
  )

X_train <- X[amazon$train_id == "train", ]
y_train <- amazon$user_id[amazon$train_id == "train"]

model <- cv.glmnet(
  X_train,
  y_train,
  alpha = 0.9,
  family = "multinomial",
  nfolds = 3,
  trace.it = FALSE,
  relax = FALSE,
  lambda.min.ratio = 0.01,
  nlambda = 100
)

You have probably noticed that the normal coef table is a bit hard to read. Here is an alternative lay-out that just shows the positive terms associated with each author:

temp <- coef(model, s = model$lambda[25])
beta <- Reduce(cbind, temp)[-1,]

beta <- filter(tibble(
  user_id = rep(names(temp), each = nrow(beta)),
  term = rep(rownames(beta), ncol(beta)),
  coef = as.numeric(beta)
), coef != 0)

beta %>%
  arrange(desc(coef)) %>%
  filter(coef > 0) %>%
  left_join(uname, by = "user_id") %>%
  group_by(user_name) %>%
  summarize(these = paste(term, collapse = "; ")) %>%
  mutate(output = sprintf("%30s : %s", user_name, these)) %>%
  use_series(output) %>%
  cat(sep = "\n")
##               Andre S. Grindle : funk; groove
##  Bernard Michael O&#039;Hanlon : :; Practice; Mozart
##                   Bjorn Viberg : 5/5; 4/5
##              Blake Meahl  2018 : enjoyable; fun; i; soundtrack
##                          Bryan : *; melody; vocal; guitar; really; songwriting
##                         Deimos : \m/\m/
##            Donald E. Gilliland : ---
##                           G.D. : generally; work; colorful; relatively; color; composer
##            George O&#039;Leary : Billboard; chart; b; #; Pop
##                     Grady Harp : Grady; Harp
##                    hyperbolium : hyperbolium; Hyperbolium; com; dot
##                          IRate : 1/2
##               J Scott Morrison : Scott; Morrison
##             Matthew G. Sherwin : artwork; sing; arrangement; perform; CD; entitle; ;; number
##           MAXIMILLIAN MUHAMMAD : Genius
##                    Paul Allaer : min
##           Peter Durward Harris : sixty; hit; country
##                    Ralph Moore : celebrated; favourite; voice; bargain; recording
##              Santa Fe Listener : reading; DG
##                      scoundrel : downtempo; dub

Looking at stars

We can also look at the covariates to show the data in some new ways. For example, how about the number of stars that each reviewer uses:

amazon %>%
  select(-user_name) %>%
  left_join(uname, by = "user_id") %>%
  ggplot(aes(stars, user_name)) +
    geom_count()

Or similarly, the average number of stars that are used:

amazon %>%
  select(-user_name) %>%
  left_join(uname, by = "user_id") %>%
  group_by(user_name) %>%
  summarize(sm_mean_ci_normal(stars)) %>%
  arrange(desc(stars_mean)) %>%
  mutate(user_name = fct_inorder(user_name)) %>%
  ggplot(aes(user_name, stars_mean)) +
    geom_pointrange(aes(ymin = stars_ci_min, ymax = stars_ci_max)) +
    coord_flip()