04. Text Analysis II

Load the Data

For today’s notes we are going to look again at the Amazon product classification task. We will read in the docs and anno tables:

docs <- read_csv("../data/amazon_product_class.csv")
anno <- read_csv("../data/amazon_product_class_token.csv.gz")

As a reminder from last time, the goal is to predict which product category a review of form based on a the user product review text. Today, we will focus on the ways that we can influence the construction of the elastic net model.

Selecting the Features

By far the most important way that we can modify the model is by selecting the features that will be used in the model. By default, the model uses the lemma column to build features. It creates features from the 10,000 most frequent terms that are in at least 0.1% of the documents. Here is the function to build the elastic net model with all of those default made explicit:

model <- dsst_enet_build(anno, docs,
                         min_df = 0.001,
                         max_df = 1.0,
                         max_features = 10000)

We can change the model by modifying these value. For example, setting max_df to 0.5 will not include lemmas that are in more than half of the reviews. This can be useful if the model is including too many stylistic words, such as pronouns, that we do not want as part of our analysis. There is no clear rule about how to set these values. Experimentation is key and you will practice this in the class notes for today.

Filter Part of Speech

Another way to change the predictive model that is being created is through filtering the annotations table before passing it to the dsst_enet_build function. This requires that we write a little bit of our own R code. One of the most common ways we will want to filter the data is through the column upos, the Universal POS tags.

We can use the upos tags to select only certain parts of speech. This can be useful in a few ways, such as removing structural tokens (i.e., punctuation and determinants), or focusing on stylistic features such as adjectives in order to avoid content-specific terms. Here is a typical usage, where we include only the nouns, verbs, adjectives, adverbs, and pronouns:

model <- anno %>%
  filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB", "PRON")) %>%
  dsst_enet_build(docs)

Often you will find that you want to build multiple models with different parts of speech to understand a dataset.

N-grams

Another way to modify the model is to transform the lemmas found in the anno table. A popular way to do this is through the use of features called N-grams. These look at combinations of adjacent tokens rather than just individual tokens. The N refers to the number of tokens that are put together.

The helper function dsst_ngram will produce N-gram features of various lengths. We can specify the maximum (n) and minimum (n_min) length of the tokens. The output of this can then we passed directly to dsst_enet_build. For example:

model <- anno %>%
  dsst_ngram(n = 2, n_min = 2) %>%
  dsst_enet_build(docs)

You can see the impact of the ngrams in the model coefficients:

dsst_coef(model$model, lambda_num = 30)

## 13 x 4 sparse Matrix of class "dgCMatrix"
##                    book        film         food MLN
## (Intercept) -0.15538849 0.006374181  0.149014308   .
## this book    0.79035526 .            .             2
## this movie   .          0.505010237  .            13
## the book     0.17900509 .            .            18
## this film    .          0.207687586  .            20
## book .       0.19341191 .            .            20
## film be      .          0.114304618  .            22
## the film     .          0.078635352  .            23
## movie .      .          0.172070238  .            24
## book be      0.08036933 .            .            25
## to read      0.09108604 .            .            28
## film .       .          0.041923087  .            28
## of the       .          .           -0.009114281  29

An extension of this approach are called skip-grams. They work similarly but allow combinations of N words to skip up to K terms. To implement these, we use the function dsst_skip_gram:

model <- anno %>%
  dsst_skip_gram(n = 2, n_min = 2, k = 1)  %>%
  dsst_enet_build(docs)

It can be fun to experiment with different settings for these values, but keep in mind that setting the parameters too high will result in a very large dataset that may take a long time to finish running.

More Parts of Speech

We can also use the upos values to change the annotations, rather than just filtering them. First of all, you will have noticed that the annotations create a lemmatised version of pronouns that maps every pronoun to the string “-PRON-”. You may want to put these back into the lemmas, which we can do with the following code:

model <- anno %>%
  mutate(lemma = if_else(upos == "PRON", tolower(token), lemma)) %>%
  mutate(lemma = if_else(lemma == "i", "I", lemma)) %>%
  dsst_enet_build(docs)

We can also add the part of speech tag to the end of the lemma. This rarely changes the model much but helps interpret the coefficients:

model <- anno %>%
  mutate(lemma = paste0(lemma, "_", upos)) %>%
  dsst_enet_build(docs)

You can see the effect of this change here:

dsst_coef(model$model, lambda_num = 50)

## 16 x 4 sparse Matrix of class "dgCMatrix"
##                     book       film         food MLN
## (Intercept)   -0.2160591 0.01776748  0.198291615   .
## book_NOUN      0.6925786 .           .             2
## movie_NOUN     .         0.56825903  .            10
## film_NOUN      .         0.24714900  .            18
## flavor_NOUN    .         .           0.589105078  19
## read_VERB      0.4231783 .           .            20
## taste_VERB     .         .           0.599092119  22
## taste_NOUN     .         .           0.596607517  22
## product_NOUN   .         .           0.299008650  32
## watch_VERB     .         0.16460704  .            37
## dvd_NOUN       .         0.20163264  .            39
## eat_VERB       .         .           0.133483895  41
## story_NOUN     .         .          -0.067088477  43
## read_NOUN      0.1294661 .           .            46
## the_DET        .         .          -0.001927111  48
## delicious_ADJ  .         .           0.011247454  50

Some more advice about how to use these techniques are included below, but as always some trial and error is needed.

Alternative features

As a final tweak today, we can also use a completely different variable in the anno table to build the features by setting the token_var option. For example, we could use the upos tags themselves:

model <- anno %>%
  dsst_enet_build(docs, token_var = "upos")

Which results in a very different type of model:

dsst_coef(model$model)

## 15 x 4 sparse Matrix of class "dgCMatrix"
##                     book         film         food MLN
## (Intercept) -0.248980100 -0.222040273  0.471020373   .
## DET          .            0.010528716 -0.037635805   2
## ADP          0.020288105  .           -0.006273853  11
## PROPN        .            0.030423746 -0.024555573  14
## PART         0.037567756  .            .            31
## VERB         0.008172884  .            .            40
## INTJ        -0.150151789  0.023595291  .            55
## SYM          .           -0.007301116  0.119941005  57
## X            .           -0.169178788  0.102577676  61
## NUM         -0.024592991  .            0.097530599  62
## PRON         .           -0.012662239  0.023958537  63
## SPACE        .            0.005668980  .            74
## AUX         -0.008352086  .            .            87
## ADV         -0.003825879  .            .            90
## ADJ          .            .            0.005461118  93

We will see models like this more as we turn to authorship predict tasks in the next few projects.

Keywords in Context

As a final tool for today, we can also go back and identify the use of particular terms within the documents using a method called keywords in context, or KWiC. The function works by passing it a term to search for as well as the docs table and the number of items to return. Here, for example, we can see the ways that the term “read” are used in the texts:

dsst_kwic(anno, terms = "read")

## [doc00076] riting style kept me |reading| , and the simple ide
## [doc00793] ast paced & visceral |read   | .                   
## [doc00988]               Having |read   | that Tom Cruise was 
## [doc01427] its really not a bad |read   | , just disappointing
## [doc01904]               Having |read   | the first three book
## [doc02138] er and would like to |read   | more of his works . 
## [doc02340]        Thank you for |reading| this                
## [doc02782]  business book I 've |read   | sinceThe Culture Cod
## [doc02949]       I always enjoy |reading| anything of Agatha C
## [doc03237] eries before i began |reading|
## [doc03535] th information after |reading| the first section , 
## [doc03742]  guess I may as well |read   | the damned thing , n
## [doc04005]  me that I needed to |read   | Aidan Donnelley Rowl
## [doc04551]             You must |read   | the whole set of boo
## [doc04818]        It 's easy to |read   | & filled to the brim
## [doc05262]               I have |read   | the book and recomme
## [doc05409]            Excellent |read   | ! !                 
## [doc05897] ter that I only kept |reading| in hopes that the he
## [doc06473]  was easy and fun to |read   | with her and for me 
## [doc06903] book , while easy to |read   | , just does n't alwa

This will be useful when we start working with datasets where it is not immediately clear why a particular term is being associated with a label type.

Advice Summary

There is no one best way to apply these techniques; you will generally need to experiment and often use multiple models in an analysis. Note that it is not always the most predictive model that is the best to use for your analysis. We are not just (or even primarily) focused on making predictions.

I can offer a bit of general advice, though, as you work through the next set of notes and the first project. I find that first filtering on the five most common parts of speech and fixing the pronoun issue to be a good first pass to my first model. I sometimes also tag with the part of speech code for interpretability, but otherwise leave things as the defaults. This is the model I most often use for negative examples, maximum probability examples, and coefficient tables. Then, depending on the application, I may try other techniques such as n-grams with n=2 or n=3 with n_min=1. If there is a noticeable improvement I will try to understand why and otherwise not bother with it much.

Using alternative features are useful for other tasks, but not so much for the ones we will see in the next few weeks. We will return to those with Project 2.