Authorship Detection with the Elastic Net
To start, build an elastic net model using the default term-frequency
features to predict the authorship of each text.
# Question 01
model <- dsst_enet_build(anno, docs)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
Now, compute the overall error rate of this model for each of the
training and validations sets. You should see that the model is much
better on the training set than it is on the validation set.
# Question 02
model$docs %>%
group_by(train_id) %>%
summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.0305
## 2 valid 0.173
In the code block below, look at the model coefficients. You may want
to limit the number of results by setting lambda_num
to
something around 25.
# Question 03
dsst_coef(model$model, lambda_num = 25)
## 23 x 4 sparse Matrix of class "dgCMatrix"
## Hawthorne Poe Twain MLN
## (Intercept) -0.06831729 0.017274970 0.05104232 .
## Hepzibah 0.68933277 . . 2
## get . . 0.39686293 2
## which . . -0.19567277 6
## -- -0.18525201 0.016444161 . 7
## : . . 0.29823291 10
## Pyncheon 0.40294814 . . 11
## , 0.01358310 . -0.03352979 11
## ( . 0.236224622 . 13
## Phoebe 0.22806863 . . 13
## do . . 0.09308331 14
## and . -0.002789513 0.06055773 14
## Clifford 0.14867996 . . 16
## upon . 0.114822468 . 16
## 's 0.04275078 -0.028054948 . 17
## child 0.17175466 . . 18
## go . . 0.04009645 19
## as 0.02786540 . . 19
## ) . 0.034198610 . 20
## out . . 0.02826349 22
## Tom . . 0.05565280 23
## cry 0.06396788 . . 24
## thus . 0.004653544 . 25
You should notice several interesting things about the model. What
are the main features that are being used here? Do you see any patterns
about where the strongest coefficients are concentrated?
Build another elastic net model but remove the proper nouns from the
texts.
# Question 04
model <- anno %>%
filter(upos != "PROPN") %>%
dsst_enet_build(docs)
What is the error rate of the model now? Note how it changes from
above.
# Question 05
model$docs %>%
group_by(train_id) %>%
summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.0544
## 2 valid 0.151
In the code block below, look at the model coefficients. You may want
to limit the number of results by setting lambda_num
to
something around 30.
# Question 06
dsst_coef(model$model, lambda_num = 30)
## 30 x 4 sparse Matrix of class "dgCMatrix"
## Hawthorne Poe Twain MLN
## (Intercept) -0.073564179 0.025418982 0.0481451964 .
## get . . 0.4353087458 2
## -- -0.241223694 0.030267977 . 6
## which . . -0.2266520978 6
## : . . 0.3859533904 10
## , 0.022122931 . -0.0460348792 11
## ( . 0.277524236 . 13
## do . . 0.1177292623 14
## 's 0.109415523 -0.074776929 . 14
## and . . 0.0890280497 14
## upon . 0.178624536 . 15
## as 0.061963286 . . 17
## child 0.258828863 . . 18
## go . . 0.0659779319 19
## ) . 0.078434183 . 20
## out . . 0.0699285886 21
## cry 0.214942858 . . 24
## thus . 0.089718602 . 26
## king . . 0.0589868543 26
## length . 0.105794765 . 27
## although . 0.103875322 . 27
## old 0.036604609 -0.023741449 . 27
## come . -0.029399521 . 27
## young 0.044407384 . . 28
## then . . 0.0140891938 28
## ' -0.008326852 . . 29
## ; . . 0.0056517226 29
## warn't . . 0.0030551040 30
## period . 0.002901135 . 30
## of . . -0.0005576938 30
You should now notice that the types of words being used are very
different. What are the primary qualities of (most of) the terms now
being selected?
Now, build an elastic net model using the “xpos” column to build
frequencies (i.e., set “xpos” to the token_var
parameter).
These codes offer a more granular way of describing parts of speech.
# Question 07
model <- anno %>%
dsst_enet_build(docs, token_var = "xpos")
Compute the error rate, comparing it to the previous error rates.
# Question 08
model$docs %>%
group_by(train_id) %>%
summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.284
## 2 valid 0.298
We can do a bit better using n-grams of xpos tags. Create a model
using 3-grams of xpos tags.
# Question 09
model <- anno %>%
dsst_ngram(token_var = "xpos", n = 3, n_min = 1) %>%
dsst_enet_build(docs)
Finally, compute the error rate of this model:
# Question 10
model$docs %>%
group_by(train_id) %>%
summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.113
## 2 valid 0.188
How does it compare the word-based model above? It will likely be
very close, though perhaps still slightly worse.
Authorship Detection with the k-nearest neighbors
Now, let’s use the k-nearest neighbors method to build a model. Start
by creating a knn model with all of the default features and k = 1 (the
default).
# Question 11
model <- anno %>%
dsst_knn_build(docs)
Now, print out the error rate of the model. Compare to the same model
using the elastic net.
# Question 12
model$docs %>%
group_by(train_id) %>%
summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0
## 2 valid 0.723
You should have noticed that knn is significantly worse than the
elastic net model. Take a few moments and think about what the
difference between these models is and why knn is not particularly
suitable to the prediction task here.
Data Visualization
For the last few questions, we’ll do some data visualization directly
using the information in the anno
and docs
tables. The general pattern of how to do these is to first manipulate
and summarize the annotations, then join to the documents table to get
the labels, and finally summarize again and plot the results.
For the first question, create a bar plot (remember, this needs
geom_col
) that shows the average length of sentences in the
data broken down by the author and whether the data comes from the
training or validation set. Use the fill color to distinguish the
train/valid split (hint: you’ll need to set position = “dodge” in the
geometry).
# Question 13
anno %>%
group_by(doc_id, sid) %>%
summarize(n = n()) %>%
left_join(docs, by = "doc_id") %>%
group_by(train_id, label) %>%
summarize(avg = mean(n)) %>%
ggplot(aes(label, avg)) +
geom_col(aes(fill = train_id), position = "dodge")
Now, for each combination of the train/valid set and author, compute
the average number of verbs in each sentence and the average number of
adjectives in each sentence. Plot these two measurements using a scatter
plot, with color indicating the author and shape indicating the
train/valid split.
# Question 14
anno %>%
group_by(doc_id, sid) %>%
summarize(n_verb = sum(upos == "VERB"), n_adj = sum(upos == "ADJ")) %>%
left_join(docs, by = "doc_id") %>%
group_by(train_id, label) %>%
summarize(
avg_verb = mean(n_verb),
avg_adj = mean(n_adj)
) %>%
ggplot(aes(avg_verb, avg_adj)) +
geom_point(aes(color = label, shape = train_id), size = 3)
Now, we will create a plot that shows the average number of verbs in
a sentence for each combination of the author and train/valid set. In
this plot, put the author on the x-axis and the average on the y-axis;
use color to distinguish the train/valid split. In addition, compute
what we call the standard error of the average number of verbs
by dividing the standard deviation (the function sd
) of the
number of verbs by the square-root (sqrt
) of the number of
the number of sentences. Finally, plot the data using the geometry
geom_pointrange
; this requires setting the x-aesthetic,
y-aesthetic, as well as ymin and ymax aesthetic. Set the values of these
last two to be the average number of verbs minus and plus the standard
error time two.
This sounds like a lot of work, but the code is actually not very
long. What it does is produce confidence intervals for a measurement of
where we would expect the average values to be if we re-sampled in a
similar way from the same data.
# Question 15
anno %>%
group_by(doc_id, sid) %>%
summarize(n_verb = sum(upos == "VERB")) %>%
left_join(docs, by = "doc_id") %>%
group_by(train_id, label) %>%
summarize(avg = mean(n_verb), s = sd(n_verb) / sqrt(n())) %>%
ggplot(aes(label, avg)) +
geom_pointrange(aes(
ymin = avg - 2 * s, ymax = avg + 2 * s, color = train_id
))
This is a tricky question if you’ve never worked with confidence
intervals before. Just take a look at the solutions and try to
understand how to interpret the results. We will talk more about this
model and others like it next week.
A final note: in all of these visualizations, I had you split the
training and validations datasets and look at them in relation to one
another. This is because of the special nature of the train/valid split
(they are different novels) in this example. In the project, I’d
recommend replicating plots like these using only the validation
data.