Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options message=FALSE and echo=FALSE to avoid cluttering your solutions with all the output from this code.

Reading the Data

To start, let’s read in some text data to study for today. The dataset for this notebook consists of a set of movie reviews.

docs <- read_csv("../data/imdb_polarity_mod.csv.gz")
anno <- read_csv("../data/imdb_polarity_mod_token.csv.gz")

The label we are trying to predict is whether a review is a 4 (out of 10) or a 7 (out of 10).

Questions

Model with defaults

Start by building an elastic net model for this data using the default values. Note that you will have to set the parameter label_var equal to “category” because the label has a different name.

# Question 01
model <- anno %>%
  dsst_enet_build(docs, label_var = "category")
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead

Now compute the error rate of the model.

# Question 02
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(category != pred_label))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0.155
## 2 valid    0.263

Then, the model coefficients. Select a smaller lambda number to only have a small set of parameters.

# Question 03
dsst_coef(model$model, lambda_num = 10)
## 2 x 3 sparse Matrix of class "dgCMatrix"
##                     S04         S07 MLN
## (Intercept) -0.01894224  0.01894224   .
## bad          0.04593585 -0.04593585   2

And finally, use the command plot applied to the model$model object to obtain the cross-validation curve.

# Question 04
plot(model$model)

Now, before proceeding, take a moment to look at all of the results and understand what the model is telling you about the data and how well we can make predictions of the label.

Model using verbs

Now, fit an elastic net model using only those lemmas labeled as a VERB.

# Question 05
model <- anno %>%
  filter(upos %in% c("VERB")) %>%
  dsst_enet_build(docs, label_var = "category")

Now compute the error rate of the model.

# Question 06
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(category != pred_label))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0.338
## 2 valid    0.398

Then, the model coefficients. Again, select a lambda number to get a reasonable number of values.

# Question 07
dsst_coef(model$model, lambda = 10)
## 8 x 3 sparse Matrix of class "dgCMatrix"
##                      S04          S07 MLN
## (Intercept) -0.006087754  0.006087754   .
## fail         0.082129984 -0.082129984   2
## waste        0.084807878 -0.084807878   3
## enjoy       -0.042982352  0.042982352   3
## suppose      0.032275276 -0.032275276   6
## try          0.007038441 -0.007038441   8
## marry       -0.009725810  0.009725810  10
## care         0.002053509 -0.002053509  10

And finally, plot the cross-validation curve.

# Question 08
plot(model$model)

Take a moment to compare the results to the previous model. How well does this do compared to all of the parts of speech? Can you make any sense of the coefficients in a new way that tells you something about the data? How about the cross-validation curve, does it tell you anything about the model?

Model using ngrams

Now, fit an elastic net model using bigrams and unigrams of all the lemmas.

# Question 09
model <- anno %>%
  dsst_ngram(n = 2, n_min = 1) %>%
  dsst_enet_build(docs, label_var = "category")

Now compute the error rate of the model.

# Question 10
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(category != pred_label))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0.112
## 2 valid    0.25

Then, the model coefficients selecting a reasonable lambda number.

# Question 11
dsst_coef(model$model, lambda_num = 20)
## 13 x 3 sparse Matrix of class "dgCMatrix"
##                      S04          S07 MLN
## (Intercept) -0.040567580  0.040567580   .
## bad          0.081521936 -0.081521936   2
## waste        0.118656271 -0.118656271  11
## ?            0.014887124 -0.014887124  14
## dull         0.061141609 -0.061141609  15
## great       -0.015422836  0.015422836  17
## perfect     -0.027497384  0.027497384  18
## excellent   -0.019273815  0.019273815  18
## nothing      0.011130787 -0.011130787  18
## enjoyable   -0.021937252  0.021937252  19
## still       -0.009224031  0.009224031  19
## well worth  -0.008911788  0.008911788  20
## no           0.001543492 -0.001543492  20

And then plot the cross-validation curve.

# Question 12
plot(model$model)

How does this model do compared to the original model? How many of the top variables are bigrams compared with unigrams? How much new do you learn about the data from the bigram model?

Model using pos values

Finally, fit an elastic net model using just the upos codes.

# Question 13
model <- anno %>%
  dsst_enet_build(docs, token_var = "upos", label_var = "category")

Now compute the error rate of the model.

# Question 14
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(category != pred_label))
## # A tibble: 2 × 2
##   train_id erate
##   <chr>    <dbl>
## 1 train    0.415
## 2 valid    0.419

Then, the model coefficients.

# Question 15
dsst_coef(model$model)
## 14 x 3 sparse Matrix of class "dgCMatrix"
##                       S04           S07 MLN
## (Intercept)  0.0224700966 -0.0224700966   .
## PROPN       -0.0067638511  0.0067638511   2
## INTJ         0.0499835215 -0.0499835215   4
## PART         0.0193797200 -0.0193797200  17
## CCONJ       -0.0134437954  0.0134437954  32
## SPACE        0.0173162617 -0.0173162617  39
## X            0.0125119143 -0.0125119143  39
## ADV          0.0053558263 -0.0053558263  44
## NUM         -0.0090062832  0.0090062832  45
## AUX          0.0059677497 -0.0059677497  50
## DET         -0.0046865936  0.0046865936  51
## VERB         0.0031681309 -0.0031681309  66
## PRON        -0.0035028569  0.0035028569  70
## ADJ         -0.0006364654  0.0006364654  71

And finally, plot the cross-validation curve.

# Question 16
plot(model$model)

How does this model do compared to the original one? Probably not great, but notice that it does do better than random guessing. What can be learned from the model? Can you understand why any of the coefficients have the signs that they do? Does this help you understand any results above?

Best model?

If you have some extra time, try to find the most predictive model that you can create in the block below.

# Question 17
# No fixed solution; did you try trigrams and/or part of speech labels? maybe
# also increasing the number of terms?