03. Text Analysis I

Load the Data

The first step in doing predictive text analysis in R is to load some data to work with. Later in the semester you will see how to construct a dataset directly yourself, but until then we will mostly used datasets that I have prepared for you. These data sets will be in a standard format, with two different files.

Today we will look at an example trying to predict the product category that an Amazon user review is associated with. Let’s read the two data tables into R and then talk about how they are formatted and how they can be used. In generally, we will always use the same variable names for these two tables: docs and anno (annotations).

docs <- read_csv("../data/amazon_product_class.csv")
anno <- read_csv("../data/amazon_product_class_token.csv.gz")

In each row of the docs table one we have one row for each product review. These correspond to the observations that we discussed in the previous notes. In text analysis, we use the term document to describe each observation; you will also hear me call the entire set of documents a corpus. Let’s take a look at the first few rows of the data:

docs

## # A tibble: 8,823 × 4
##    doc_id   label train_id text                                             
##    <chr>    <chr> <chr>    <chr>                                            
##  1 doc00001 food  valid    "At some point I would like to try TeffBob's Red…
##  2 doc00002 book  train    "John Crowley wrote  \"Little, Big\", in the wor…
##  3 doc00003 book  valid    "The New York Post is often credited with popula…
##  4 doc00004 film  valid    "INTO THIN AIR, based on Jon Krakauer's best sel…
##  5 doc00005 film  train    "When the Wind Blows was based upon an English n…
##  6 doc00006 food  train    "I have sent this basket a number of times to fa…
##  7 doc00007 book  train    "If you enjoy history, this book is an selection…
##  8 doc00008 film  valid    "Though it holds up surprisingly well over thirt…
##  9 doc00009 book  train    "Agatha's written is amazing. The whole story is…
## 10 doc00010 food  valid    "Pecans are my favorite nut, alas, they are fair…
## # … with 8,813 more rows

We see that the data contains four columns. The first one is called doc_id, which contains a unique key that describes each document. Every docs table we use will have this variable. The next column contains the label of each document. This is exactly the same as what we called the label in our previous notes. There is also a column called train_id that has already split the data randomly into train and validation sets. This is helpful so that everyone is useful the exact same data for comparison purposes. Finally, the last column is called text; it contains the full text of the review.

Our predictive modelling goal is to predict the label using the text. As we have discussed, we cannot directly fit a model using the text variable as a feature. Instead, we need to produce a set of numeric features that summarize the text. One of the most common methods for doing this is to use features called term frequencies. These are features that count how many times words or other linguistic element occurs in the text. To do this, we will make use of the second data table.

The anno data table has been automatically created from the docs table using a set of predictive models called an NLP pipeline. This pipeline is not the direct subject of this course, but in later notes we will see how to apply it and create the annotations directly. For now, we will just use the ones that I pre-computed. Here is what the first few rows of the table look like:

anno

## # A tibble: 1,548,904 × 10
##    doc_id     sid   tid token   token_wi…¹ lemma upos  xpos  tid_s…² relat…³
##    <chr>    <dbl> <dbl> <chr>   <chr>      <chr> <chr> <chr>   <dbl> <chr>  
##  1 doc00001     1     1 At      At         at    ADP   IN          6 prep   
##  2 doc00001     1     2 some    some       some  DET   DT          3 det    
##  3 doc00001     1     3 point   point      point NOUN  NN          1 pobj   
##  4 doc00001     1     4 I       I          -PRO… PRON  PRP         6 nsubj  
##  5 doc00001     1     5 would   would      would VERB  MD          6 aux    
##  6 doc00001     1     6 like    like       like  VERB  VB          0 root   
##  7 doc00001     1     7 to      to         to    PART  TO          8 aux    
##  8 doc00001     1     8 try     try        try   VERB  VB          6 xcomp  
##  9 doc00001     1     9 TeffBob TeffBob    Teff… PROPN NNP        15 poss   
## 10 doc00001     1    10 's      's         's    PART  POS         9 case   
## # … with 1,548,894 more rows, and abbreviated variable names
## #   ¹token_with_ws, ²tid_source, ³relation

We refer to each of the rows as a token, which are either words, word parts, or punctuation marks. Notice that if you read the values in the token column down the column it reconstructs the start of the first document. As this table has been automatically constructed, the column names in the annotations table are fairly stable across different datasets, with some occasional additions.

For now, let’s focus on just four of the columns. The first one contains the doc_id that can be used to associate each token with a document. We also see the token column that contains the token itself, which we can count up to create features for the prediction task. There is also a column called lemma which contains a standardized version of the token. For example, it remove start-of-sentence capitalization and puts all verbs into the infinitive. As the name suggests, this form is called a lemma. Usually we will use the lemmas rather than the tokens to construct a model. Finally, we see a column called upos, the universal part of speech code associated with the token. These will be useful in our next set of notes.

We now have all of the data we need to construct a predictive model. You could image the following manual procedure to construct numeric features:

Choose which lemmas we want to use as features; perhaps all of them?
For each lemma, create a new variable in the docs table that counts how often the lemma occurs in that document.
Build a model (elastic net?) using these numeric features on the rows of the docs table that have train_id equal to “train”.
Evaluate the model on the rows of the docs table that have train_id equal to “valid”.

In the next section, I will show you how we can do these steps using low-level R code. You’ll see that it’s not too difficult but requires a lot of temporary variables and bookkeeping. In the following section, I will show you a wrapper functions that make it so you don’t need to copy and paste all of this code every time you want to run a model.

Building the Model (long way)

Let’s see how we can run an elastic net using the Amazon data we loaded above using low-level R functions. Note that in the code below I am using the code dataset_name$variable_name to extract a specific variable from a specific dataset. This is needed when working outside of verbs and ggplot commands.

To start, let’s get a vector of all the unique documents and lemmas (standardized words) from the data using the unique() function:

document_set <- unique(anno$doc_id)
vocab_set <- unique(anno$lemma)

Now, I will use the match() function to create an index to tell me with document and lemma every row of the anno data is associated with.

dindex <- as.numeric(match(anno$doc_id, document_set))
tindex <- as.numeric(match(anno$lemma, vocab_set))

Next, we will create a matrix, a square array of numbers. The matrix we will create has one row for each document and one column for each unique term. The numbers count how often each term occurs in a given document. Since most terms do not occur in most documents, this matrix will have a large number of zeros. To account for this, we will create a sparse matrix object that only stores the non-zero elements. Here’s the code that creates such an object and displays its dimensions:

X <- Matrix::sparseMatrix(
  i = dindex, j = tindex, x = 1,
  dims = c(length(document_set), length(vocab_set)),
  dimnames = list(document_set, vocab_set)
)
dim(X)

## [1]  8823 53756

We can simplify things by removing any terms that have only a few occurrences. Here, for example, is the code to only keep data with at least 20 instances in the data:

X <- X[, colSums(X) >= 20]
dim(X)

## [1] 8823 4405

To illustrate where we are, here are the first 12 rows and 12 columns of the data.

X[1:12, 1:12]

## 12 x 12 sparse Matrix of class "dgCMatrix"
##          at some point -PRON- would like to try 's Red Mill  ,
## doc00001  1    1     2      7     1    2  2   1  1   1    1  3
## doc00002  1    .     .     19     .    1  6   .  .   .    . 24
## doc00003  .    .     1      9     .    1  6   .  .   .    .  5
## doc00004  .    2     .     12     .    .  6   .  2   .    . 11
## doc00005  .    1     .      3     .    1  3   .  .   .    .  1
## doc00006  .    .     .      2     .    .  1   .  .   .    .  1
## doc00007  .    .     .      9     .    1  3   .  .   .    .  5
## doc00008  2    .     .     20     .    3  8   .  2   .    . 33
## doc00009  .    .     .      6     .    .  1   .  1   .    .  .
## doc00010  .    .     .     14     1    .  1   .  .   .    . 12
## doc00011  .    .     .     13     .    2  4   .  2   .    .  6
## doc00012  .    .     .      4     .    .  1   .  .   .    .  4

Now, we need to split the matrix X into those rows in the training set and the validation set. We also need to make a pair of vectors y that store the variable that we are trying to predict.

X_train <- X[docs$train_id == "train",]
X_valid <- X[docs$train_id == "valid",]
y_train <- docs$label[docs$train_id == "train"]
y_valid <- docs$label[docs$train_id == "valid"]

Now that we have the data, we can run the elastic net module using the function cv.glmnet from the glmnet package. I will set a few options to different values to make the function run faster.

model <- glmnet::cv.glmnet(
  X_train, y_train, family = "multinomial", lambda_min_ratio = 0.1, nfolds = 3, 
)

Finally, we can use the predict function to see what the model would predict for each observation on the validation data. I’ll use the table function to show a confusion matrix of the predictions and the actual responses.

y_valid_pred <- predict(model, newx = X_valid, type = "class")
table(y_valid, y_valid_pred)

##        y_valid_pred
## y_valid book film food
##    book 1115   58   36
##    film   29 1120   29
##    food    6    2 1147

Note that the cv.glmnet function automatically performs cross validation and returns the results using the lambda with the best predictive power on the training data.

Build the Model (easy way)

In the quick run through of the commands above, you can see that we can run elastic net models on textual data with a few new commands and several intermediate steps. Understanding this process is important and useful. However, it becomes clunky if you have to do all of those steps every single time you want to fit a model. Copy and pasting code quickly becomes the focus rather than understanding the data and what it is telling us. As a solution to this problem, I have written some wrapper functions that take care of the bookkeeping of running certain models. These are all provided by the **dsst* package and are prefixed with the string “dsst_”.

The main function we will use to build a predictive model is called dsst_enet_build. We need to pass the function the anno table and the docs table. It has a number of options that we can modify (these correspond to the options I selected above), but for now let’s just use the default values:

model <- dsst_enet_build(anno, docs)

And that’s it! Really. The model object has two elements that contain the glmnet model object (model$model) and the document table augmented with predictions from the cross-validated elastic net (model$docs). By default the function has created numeric features from the 10,000 most frequently used lemmas that are used in at least 0.1% of the documents. Let’s take a look at some of the ways that we can evaluate the model. Next class we will look at more ways to modify the way the model itself is built.

Evaluate the Model Fit

As a starting point, we want to see how well the model has done making predictions on the training and validation data. We can do this using standard data verbs that we learned in 289 applied to the augmented docs data. Here’s what the document table looks like:

model$docs

## # A tibble: 8,823 × 6
##    doc_id   label train_id text                              pred_…¹ pred_…²
##    <chr>    <chr> <chr>    <chr>                             <chr>     <dbl>
##  1 doc00001 food  valid    "At some point I would like to t… food      0.565
##  2 doc00002 book  train    "John Crowley wrote  \"Little, B… book      0.998
##  3 doc00003 book  valid    "The New York Post is often cred… book      0.943
##  4 doc00004 film  valid    "INTO THIN AIR, based on Jon Kra… film      0.928
##  5 doc00005 film  train    "When the Wind Blows was based u… film      0.682
##  6 doc00006 food  train    "I have sent this basket a numbe… food      0.492
##  7 doc00007 book  train    "If you enjoy history, this book… book      0.504
##  8 doc00008 film  valid    "Though it holds up surprisingly… film      0.996
##  9 doc00009 book  train    "Agatha's written is amazing. Th… book      0.962
## 10 doc00010 food  valid    "Pecans are my favorite nut, ala… food      0.843
## # … with 8,813 more rows, and abbreviated variable names ¹pred_label,
## #   ²pred_value

And the error rates can be computed by grouping by the train id and summarizing the proportion of labels that match their predictions.

model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(label != pred_label))

## # A tibble: 2 × 2
##   train_id  erate
##   <chr>     <dbl>
## 1 train    0.0598
## 2 valid    0.0703

We see that around 6% of the training data was mis-classified and 7% of the validation data was mis-classified. Not too bad as a first pass! We can learn a bit more by using an error rate segmented by the real label:

model$docs %>%
  group_by(label, train_id) %>%
  summarize(erate = mean(label != pred_label))

## # A tibble: 6 × 3
## # Groups:   label [3]
##   label train_id   erate
##   <chr> <chr>      <dbl>
## 1 book  train    0.102  
## 2 book  valid    0.120  
## 3 film  train    0.0729 
## 4 film  valid    0.0789 
## 5 food  train    0.00618
## 6 food  valid    0.00952

We see that books were the hardest to classify, with films being the next hardest, and food being the easiest. We can get a picture of what kind of errors are being made by using a confusion matrix with the function table. This function will print the cross-tabulated counts of two or more variables. Be careful not to pass it the whole dataset and note that the order of the variables matters.

model$docs %>%
  select(label, pred_label, train_id) %>%
  table()

## , , train_id = train
## 
##       pred_label
## label  book film food
##   book 1566   81   96
##   film   31 1629   97
##   food    6    5 1770
## 
## , , train_id = valid
## 
##       pred_label
## label  book film food
##   book 1064   74   71
##   film   21 1085   72
##   food    7    4 1144

Note that the actual labels are on the rows and the predicted labels are on the columns. We see that the book errors were evenly distributed between films and food; however, the film errors were mostly being confused with books.

Model Coefficients

Making a predictive model is great and a good sign that our model makes sense and that the term frequencies in our model are associated with them labels. However, our real goal is using the model to understand the data. To do this, a key tool will be to look at the model coefficients. We can do this with the dsst_coef function. By default, the values are given based off of the best model from the cross-validation process:

dsst_coef(model$model)

## 82 x 4 sparse Matrix of class "dgCMatrix"
##                    book         film         food MLN
## (Intercept) -0.23710135 -0.005983007  0.243084361   .
## book         1.29678087  .            .             2
## movie        .           1.159127858  .            10
## taste        .           .            1.343867317  12
## read         1.05941764  .            .            13
## film         .           0.591468430  .            18
## flavor       .           .            1.183122329  20
## product      .           .            0.716630638  35
## watch        .           0.553063391  .            37
## dvd          .           0.889997431  .            41
## eat          .           .            0.480897421  45
## story        0.06211210  .           -0.336843751  45
## the          .           .           -0.009442920  48
## tasty        .           .            0.796788934  51
## delicious    .           .            0.659429355  53
## see          .           0.206602875  .            53
## video        .           0.479056328  .            54
## price        .           .            0.469243624  54
## snack        .           .            0.431084908  54
## cup          .           .            0.353970904  54
## who          .           .           -0.097558210  54
## of           .           .           -0.045304830  54
## these        .           .            0.330488315  55
## brand        .           .            0.544813559  57
## episode      .           0.349504443  .            57
## novel        0.29863309  .            .            58
## reader       0.55822168  .            .            59
## use          .           .            0.277648163  60
## DVD          .           0.448443090  .            63
## acting       .           0.536887348  .            64
## performance  .           0.308652430  .            64
## coffee       .           .            0.260610828  64
## classic      .           0.402025321  .            65
## write        0.31135438  .            .            66
## show         .           0.180511772  .            66
## store        .           .            0.130770384  66
## an           .           .           -0.074406741  66
## 's           .           .           -0.030028168  67
## author       0.35035807  .            .            69
## drink        .           .            0.311248324  69
## Klausner     1.17853114  .            .            70
## fresh        .           .            0.316878880  71
## scene        .           0.196899565  .            72
## tasting      .           .            0.569251636  73
## Amazon       .           .            0.213702688  73
## healthy      .           .            0.378844481  74
## package      .           .            0.328979910  74
## workout      .           0.206264884  .            74
## tape         .           0.186961381  .            74
## add          .           .            0.144282343  76
## actor        .           0.271256472  .            77
## buy          .           .            0.122092941  77
## bag          .           .            0.168075540  78
## flavorful    .           .            0.377893502  84
## free         .           .            0.121223106  84
## by           .           .           -0.017012223  84
## order        .           .            0.128617141  85
## star         .           0.076446712  .            87
## flick        .           0.185894801  .            88
## mix          .           .            0.121992066  88
## cook         .           .            0.073720144  89
## comedy       .           0.095615737  .            90
## play         .           0.010740561  .            90
## bottle       .           .            0.103139675  91
## series       .           .           -0.053487185  91
## reading      0.13638857  .            .            92
## box          .           .            0.027353742  92
## cast         .           0.044758504  .            93
## mystery      0.07526256  .            .            94
## size         .           .            0.024037498  94
## faithful     .           0.280089571  .            95
## soup         .           .            0.055309128  95
## romance      0.04240444  .            .            95
## gluten       .           .            0.030893505  95
## learn        0.02279078  .            .            96
## sweet        .           .            0.018413373  96
## music        .           0.016992794  .            96
## chocolate    .           .            0.016349201  96
## man          .           .           -0.013443539  97
## fight        .           0.009729794  .            97
## cereal       .           .            0.003794406  98
## tea          .           .            0.001644503  98

Usually, this gives too many values to easily interpret. Instead, we want to choose a smaller value for lambda. This requires some experimentation through setting the lambda_num parameter, which controls the lambda number that we use. The allowed values are from 1 (this is the largest lambda) down to 100 (the smallest lambda). Looking at the 10th value here produces a very small model that is easy to interpret:

dsst_coef(model$model, lambda_num = 10)

## 3 x 4 sparse Matrix of class "dgCMatrix"
##                    book        film       food MLN
## (Intercept) -0.08882645 0.036110762 0.05271569   .
## book         0.15217251 .           .            2
## movie        .          0.006710871 .           10

Increasing to 20 includes more terms and a richer understanding of the classes:

dsst_coef(model$model, lambda_num = 20)

## 7 x 4 sparse Matrix of class "dgCMatrix"
##                   book       film        food MLN
## (Intercept) -0.1351490 0.04407419 0.091074811   .
## book         0.2707799 .          .             2
## movie        .         0.16116943 .            10
## taste        .         .          0.255904477  12
## read         0.1161569 .          .            13
## film         .         0.02195760 .            18
## flavor       .         .          0.007651778  20

Usually you will need to look at several different versions of the model to make interesting observations about the data.

Negative Examples

One of the most interesting things about working with text data is that we can go back to the model and manually read the text of interesting observations that are identified by the model. One thing that we can start with are looking at some negative examples. These are records that are mis-classified by our predictive model. We can grab a subset of these using the filter command along with the function slice_sample. The latter takes a random selection of rows from the data (the results change each time you run it).

model$docs %>%
  filter(label != pred_label) %>%
  filter(train_id == "valid") %>%
  slice_sample(n = 10)

## # A tibble: 10 × 6
##    doc_id   label train_id text                              pred_…¹ pred_…²
##    <chr>    <chr> <chr>    <chr>                             <chr>     <dbl>
##  1 doc05099 film  valid    "I didn't like this at all, it j… food      0.466
##  2 doc04531 book  valid    "Terence McKenna used to say tha… film      0.587
##  3 doc01392 book  valid    "Angelica Shelton Belanov has al… film      0.644
##  4 doc01614 film  valid    "Anyone reading my reviews on Pe… book      0.996
##  5 doc06233 book  valid    "I saw the first movie when I wa… film      0.975
##  6 doc07839 book  valid    "Though some of these are not to… food      0.707
##  7 doc05187 book  valid    "My very brief review:  Well wri… food      0.368
##  8 doc02410 film  valid    "In Benny Chan's Big Bullet, Lau… food      0.529
##  9 doc01911 film  valid    "I know that it's is no Grease b… food      0.412
## 10 doc01115 book  valid    "Great drawings and wonderful cl… food      0.399
## # … with abbreviated variable names ¹pred_label, ²pred_value

It can be difficult to read the text in the print out itself. I wrote a helper function called dsst_print_text that prints out the text in an easier-to-read format. We use it along with a pipe %>% to show all of the texts. It will also display all of the other metadata at the top of the record.

model$docs %>%
  filter(label != pred_label) %>%
  filter(train_id == "valid") %>%
  slice_sample(n = 10) %>%
  dsst_print_text()

## doc05694; film; valid; food; 0.416980106952054
## Dear weelchair Assassin: There IS a sequel to this, called "Another 48
## Hrs.", & it's EQUALLY GREAT!!!
## 
## doc07086; book; valid; film; 0.388108907773896
## this has some beautiful art in it. it also has some ugly art.... if
## thats what you can call it. but it does show you each artist's technic
## if not how they did each peice. so if you like anima and photoshop
## look into it. oh, thats what you are doing? well i gave it 4 stars
## didnt i. maybe three.... but i have had a good day.
## 
## doc06872; film; valid; food; 0.476807298660379
## Just realized that I reviewed this product under the Qi gong workout
## which I also purchased. This is an intense workout and if you are
## not in athletic condition you may want to start off with his Qi Gong
## workout and work up to this one. It's a perfect fighting style and
## he's a wonderful master teacher.
## 
## doc00071; book; valid; film; 0.395802657628196
## The story starts with the main character, Wicket (Wick for short)
## who is living with her little sister in a foster home after their
## mother's suicide and their sociopathic father's imprisonment. Wick is
## extremely protective of her you ger sister, Lily, and they have a good
## relationship. One day Wick discovers the diary of her former friend,
## who is now dead, with the words &#34;Find Me&#34; scribbled in it.Wick
## is a very competent computer hacker who can use her talents both for
## good and bad. She
## 
## doc00848; book; valid; food; 0.701645258487129
## It is hard to imagine less rewarding place to be a police officer
## than notorious Aberdeen, Scotland. Horrendous weather, people out
## of their minds, everything possible goes wrong at every opportunity.
## Of course, in a fictional Aberdeen- I've never been to a real one.
## Stuart MacBride continues his terrific series, filled with picturesque
## characters and heavily peppered with dark, crass humor. Just my taste.
## Apology to the faint hearted or hopelessly prudish.
## 
## doc01541; food; valid; book; 0.412296131391372
## If you are a residential user like me and don't happen to read that
## much, you will be surprised when two bags come to your door and
## realize you will never have to buy Garlic Powder ever again.That being
## said, it is a nice quality garlic powder and I definitely recommend
## using this.
## 
## doc01472; film; valid; food; 0.391786581309792
## If you like Hollywood musicals you can't beat this collection. Six
## great shows for about five dollars each. And nobody can top Rodgers
## and Hammerstein.
## 
## doc00403; book; valid; food; 0.615121491879394
## A must have book for Sondheim fans. However, many reasonably priced
## hardcover copies in excellent condition are available at much lower
## prices ($8-$20 range) than these ridiculous collector prices ($35
## at time of this post) for the softcover here. The 1986 edition in
## hardcover is the same content that is in the softcover with better
## production. So why pay more for less? It covers Pacific Overtures,
## Sweeny Todd, Merrily We Roll Along, Sunday in the Park With George,
## Sondheim Evening, Follies C
## 
## doc08128; book; valid; food; 0.401462881585185
## This is a prime example of McManus at his best. It will keep you
## laughing all the way to the end. Would recommend.
## 
## doc07973; book; valid; film; 0.365014737295158
## It was a replay, and not a good one, of the horrible case in Fla of
## who killed the Casey Anthony baby. Who abused whom, not good.

At the top of each text is the real label followed by the predicted label with the predicted probability of the label in parentheses. Can you understand why some of these were mis-classified?

On the other end of the spectrum, we can try to understand the model and the data by looking at the texts that have the highest predicted probabilities for its label. These are often classified correctly, but there are sometimes errors as well. We will look at these in the notebook for this class.