Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options message=FALSE and echo=FALSE to avoid cluttering your solutions with all the output from this code.

Reading the Data

Today we are going to look at a dataset of short texts taken from a set of 3 American authors:

docs <- read_csv("../data/stylo_us.csv")
anno <- read_csv("../data/stylo_us_token.csv.gz")

The prediction task is to determine the identity of the author based on the text.

Questions

Authorship Detection with the Elastic Net

To start, build an elastic net model using the default term-frequency features to predict the authorship of each text.

# Question 01
model <- dsst_enet_build(anno, docs)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead

Now, compute the overall error rate of this model for each of the training and validations sets. You should see that the model is much better on the training set than it is on the validation set.

# Question 02
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
##   train_id  erate
##   <chr>     <dbl>
## 1 train    0.0305
## 2 valid    0.173

In the code block below, look at the model coefficients. You may want to limit the number of results by setting lambda_num to something around 25.

# Question 03
dsst_coef(model$model, lambda_num = 25)
## 23 x 4 sparse Matrix of class "dgCMatrix"
##               Hawthorne          Poe       Twain MLN
## (Intercept) -0.06831729  0.017274970  0.05104232   .
## Hepzibah     0.68933277  .            .            2
## get          .           .            0.39686293   2
## which        .           .           -0.19567277   6
## --          -0.18525201  0.016444161  .            7
## :            .           .            0.29823291  10
## Pyncheon     0.40294814  .            .           11
## ,            0.01358310  .           -0.03352979  11
## (            .           0.236224622  .           13
## Phoebe       0.22806863  .            .           13
## do           .           .            0.09308331  14
## and          .          -0.002789513  0.06055773  14
## Clifford     0.14867996  .            .           16
## upon         .           0.114822468  .           16
## 's           0.04275078 -0.028054948  .           17
## child        0.17175466  .            .           18
## go           .           .            0.04009645  19
## as           0.02786540  .            .           19
## )            .           0.034198610  .           20
## out          .           .            0.02826349  22
## Tom          .           .            0.05565280  23
## cry          0.06396788  .            .           24
## thus         .           0.004653544  .           25

You should notice several interesting things about the model. What are the main features that are being used here? Do you see any patterns about where the strongest coefficients are concentrated?

Build another elastic net model but remove the proper nouns from the texts.

# Question 04
model <- anno %>%
  filter(upos != "PROPN") %>%
  dsst_enet_build(docs)

What is the error rate of the model now? Note how it changes from above.

# Question 05
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
##   train_id  erate
##   <chr>     <dbl>
## 1 train    0.0544
## 2 valid    0.151

In the code block below, look at the model coefficients. You may want to limit the number of results by setting lambda_num to something around 30.

# Question 06
dsst_coef(model$model, lambda_num = 30)
## 30 x 4 sparse Matrix of class "dgCMatrix"
##                Hawthorne          Poe         Twain MLN
## (Intercept) -0.073564179  0.025418982  0.0481451964   .
## get          .            .            0.4353087458   2
## --          -0.241223694  0.030267977  .              6
## which        .            .           -0.2266520978   6
## :            .            .            0.3859533904  10
## ,            0.022122931  .           -0.0460348792  11
## (            .            0.277524236  .             13
## do           .            .            0.1177292623  14
## 's           0.109415523 -0.074776929  .             14
## and          .            .            0.0890280497  14
## upon         .            0.178624536  .             15
## as           0.061963286  .            .             17
## child        0.258828863  .            .             18
## go           .            .            0.0659779319  19
## )            .            0.078434183  .             20
## out          .            .            0.0699285886  21
## cry          0.214942858  .            .             24
## thus         .            0.089718602  .             26
## king         .            .            0.0589868543  26
## length       .            0.105794765  .             27
## although     .            0.103875322  .             27
## old          0.036604609 -0.023741449  .             27
## come         .           -0.029399521  .             27
## young        0.044407384  .            .             28
## then         .            .            0.0140891938  28
## '           -0.008326852  .            .             29
## ;            .            .            0.0056517226  29
## warn't       .            .            0.0030551040  30
## period       .            0.002901135  .             30
## of           .            .           -0.0005576938  30

You should now notice that the types of words being used are very different. What are the primary qualities of (most of) the terms now being selected?

Now, build an elastic net model using the “xpos” column to build frequencies (i.e., set “xpos” to the token_var parameter). These codes offer a more granular way of describing parts of speech.

# Question 07
model <- anno %>%
  dsst_enet_build(docs, token_var = "xpos")

Compute the error rate, comparing it to the previous error rates.

# Question 08
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0.284
## 2 valid    0.298

We can do a bit better using n-grams of xpos tags. Create a model using 3-grams of xpos tags.

# Question 09
model <- anno %>%
  dsst_ngram(token_var = "xpos", n = 3, n_min = 1) %>%
  dsst_enet_build(docs)

Finally, compute the error rate of this model:

# Question 10
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0.113
## 2 valid    0.188

How does it compare the word-based model above? It will likely be very close, though perhaps still slightly worse.

Authorship Detection with the k-nearest neighbors

Now, let’s use the k-nearest neighbors method to build a model. Start by creating a knn model with all of the default features and k = 1 (the default).

# Question 11
model <- anno %>%
  dsst_knn_build(docs)

Now, print out the error rate of the model. Compare to the same model using the elastic net.

# Question 12
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0    
## 2 valid    0.723

You should have noticed that knn is significantly worse than the elastic net model. Take a few moments and think about what the difference between these models is and why knn is not particularly suitable to the prediction task here.

Data Visualization

For the last few questions, we’ll do some data visualization directly using the information in the anno and docs tables. The general pattern of how to do these is to first manipulate and summarize the annotations, then join to the documents table to get the labels, and finally summarize again and plot the results.

For the first question, create a bar plot (remember, this needs geom_col) that shows the average length of sentences in the data broken down by the author and whether the data comes from the training or validation set. Use the fill color to distinguish the train/valid split (hint: you’ll need to set position = “dodge” in the geometry).

# Question 13
anno %>%
  group_by(doc_id, sid) %>%
  summarize(n = n()) %>%
  left_join(docs, by = "doc_id") %>%
  group_by(train_id, label) %>%
  summarize(avg = mean(n)) %>%
  ggplot(aes(label, avg)) +
    geom_col(aes(fill = train_id), position = "dodge")

Now, for each combination of the train/valid set and author, compute the average number of verbs in each sentence and the average number of adjectives in each sentence. Plot these two measurements using a scatter plot, with color indicating the author and shape indicating the train/valid split.

# Question 14
anno %>%
  group_by(doc_id, sid) %>%
  summarize(n_verb = sum(upos == "VERB"), n_adj = sum(upos == "ADJ")) %>%
  left_join(docs, by = "doc_id") %>%
  group_by(train_id, label) %>%
  summarize(
    avg_verb = mean(n_verb),
    avg_adj = mean(n_adj)
  ) %>%
  ggplot(aes(avg_verb, avg_adj)) +
    geom_point(aes(color = label, shape = train_id), size = 3)

Now, we will create a plot that shows the average number of verbs in a sentence for each combination of the author and train/valid set. In this plot, put the author on the x-axis and the average on the y-axis; use color to distinguish the train/valid split. In addition, compute what we call the standard error of the average number of verbs by dividing the standard deviation (the function sd) of the number of verbs by the square-root (sqrt) of the number of the number of sentences. Finally, plot the data using the geometry geom_pointrange; this requires setting the x-aesthetic, y-aesthetic, as well as ymin and ymax aesthetic. Set the values of these last two to be the average number of verbs minus and plus the standard error time two.

This sounds like a lot of work, but the code is actually not very long. What it does is produce confidence intervals for a measurement of where we would expect the average values to be if we re-sampled in a similar way from the same data.

# Question 15
anno %>%
  group_by(doc_id, sid) %>%
  summarize(n_verb = sum(upos == "VERB")) %>%
  left_join(docs, by = "doc_id") %>%
  group_by(train_id, label) %>%
  summarize(avg = mean(n_verb), s = sd(n_verb) / sqrt(n())) %>%
  ggplot(aes(label, avg)) +
    geom_pointrange(aes(
      ymin = avg - 2 * s,  ymax = avg + 2 * s, color = train_id
    ))

This is a tricky question if you’ve never worked with confidence intervals before. Just take a look at the solutions and try to understand how to interpret the results. We will talk more about this model and others like it next week.

A final note: in all of these visualizations, I had you split the training and validations datasets and look at them in relation to one another. This is because of the special nature of the train/valid split (they are different novels) in this example. In the project, I’d recommend replicating plots like these using only the validation data.