Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.
After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.
I have set the options message=FALSE
and
echo=FALSE
to avoid cluttering your solutions with all the
output from this code.
To start, let’s read in some text data to study for today. The dataset for this notebook consists of a set of movie reviews.
<- read_csv("../data/imdb_polarity_mod.csv.gz")
docs <- read_csv("../data/imdb_polarity_mod_token.csv.gz") anno
The label we are trying to predict is whether a review is a 4 (out of 10) or a 7 (out of 10).
Start by building an elastic net model for this data using the
default values. Note that you will have to set the parameter
label_var
equal to “category” because the label has a
different name.
# Question 01
<- anno %>%
model dsst_enet_build(docs, label_var = "category")
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
Now compute the error rate of the model.
# Question 02
$docs %>%
modelgroup_by(train_id) %>%
summarize(erate = mean(category != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.155
## 2 valid 0.263
Then, the model coefficients. Select a smaller lambda number to only have a small set of parameters.
# Question 03
dsst_coef(model$model, lambda_num = 10)
## 2 x 3 sparse Matrix of class "dgCMatrix"
## S04 S07 MLN
## (Intercept) -0.01894224 0.01894224 .
## bad 0.04593585 -0.04593585 2
And finally, use the command plot
applied to the
model$model
object to obtain the cross-validation
curve.
# Question 04
plot(model$model)
Now, before proceeding, take a moment to look at all of the results and understand what the model is telling you about the data and how well we can make predictions of the label.
Now, fit an elastic net model using only those lemmas labeled as a VERB.
# Question 05
<- anno %>%
model filter(upos %in% c("VERB")) %>%
dsst_enet_build(docs, label_var = "category")
Now compute the error rate of the model.
# Question 06
$docs %>%
modelgroup_by(train_id) %>%
summarize(erate = mean(category != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.338
## 2 valid 0.398
Then, the model coefficients. Again, select a lambda number to get a reasonable number of values.
# Question 07
dsst_coef(model$model, lambda = 10)
## 8 x 3 sparse Matrix of class "dgCMatrix"
## S04 S07 MLN
## (Intercept) -0.006087754 0.006087754 .
## fail 0.082129984 -0.082129984 2
## waste 0.084807878 -0.084807878 3
## enjoy -0.042982352 0.042982352 3
## suppose 0.032275276 -0.032275276 6
## try 0.007038441 -0.007038441 8
## marry -0.009725810 0.009725810 10
## care 0.002053509 -0.002053509 10
And finally, plot the cross-validation curve.
# Question 08
plot(model$model)
Take a moment to compare the results to the previous model. How well does this do compared to all of the parts of speech? Can you make any sense of the coefficients in a new way that tells you something about the data? How about the cross-validation curve, does it tell you anything about the model?
Now, fit an elastic net model using bigrams and unigrams of all the lemmas.
# Question 09
<- anno %>%
model dsst_ngram(n = 2, n_min = 1) %>%
dsst_enet_build(docs, label_var = "category")
Now compute the error rate of the model.
# Question 10
$docs %>%
modelgroup_by(train_id) %>%
summarize(erate = mean(category != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.112
## 2 valid 0.25
Then, the model coefficients selecting a reasonable lambda number.
# Question 11
dsst_coef(model$model, lambda_num = 20)
## 13 x 3 sparse Matrix of class "dgCMatrix"
## S04 S07 MLN
## (Intercept) -0.040567580 0.040567580 .
## bad 0.081521936 -0.081521936 2
## waste 0.118656271 -0.118656271 11
## ? 0.014887124 -0.014887124 14
## dull 0.061141609 -0.061141609 15
## great -0.015422836 0.015422836 17
## perfect -0.027497384 0.027497384 18
## excellent -0.019273815 0.019273815 18
## nothing 0.011130787 -0.011130787 18
## enjoyable -0.021937252 0.021937252 19
## still -0.009224031 0.009224031 19
## well worth -0.008911788 0.008911788 20
## no 0.001543492 -0.001543492 20
And then plot the cross-validation curve.
# Question 12
plot(model$model)
How does this model do compared to the original model? How many of the top variables are bigrams compared with unigrams? How much new do you learn about the data from the bigram model?
Finally, fit an elastic net model using just the upos codes.
# Question 13
<- anno %>%
model dsst_enet_build(docs, token_var = "upos", label_var = "category")
Now compute the error rate of the model.
# Question 14
$docs %>%
modelgroup_by(train_id) %>%
summarize(erate = mean(category != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.415
## 2 valid 0.419
Then, the model coefficients.
# Question 15
dsst_coef(model$model)
## 14 x 3 sparse Matrix of class "dgCMatrix"
## S04 S07 MLN
## (Intercept) 0.0224700966 -0.0224700966 .
## PROPN -0.0067638511 0.0067638511 2
## INTJ 0.0499835215 -0.0499835215 4
## PART 0.0193797200 -0.0193797200 17
## CCONJ -0.0134437954 0.0134437954 32
## SPACE 0.0173162617 -0.0173162617 39
## X 0.0125119143 -0.0125119143 39
## ADV 0.0053558263 -0.0053558263 44
## NUM -0.0090062832 0.0090062832 45
## AUX 0.0059677497 -0.0059677497 50
## DET -0.0046865936 0.0046865936 51
## VERB 0.0031681309 -0.0031681309 66
## PRON -0.0035028569 0.0035028569 70
## ADJ -0.0006364654 0.0006364654 71
And finally, plot the cross-validation curve.
# Question 16
plot(model$model)
How does this model do compared to the original one? Probably not great, but notice that it does do better than random guessing. What can be learned from the model? Can you understand why any of the coefficients have the signs that they do? Does this help you understand any results above?
If you have some extra time, try to find the most predictive model that you can create in the block below.
# Question 17
# No fixed solution; did you try trigrams and/or part of speech labels? maybe
# also increasing the number of terms?