The first step in doing predictive text analysis in R is to load some data to work with. Later in the semester you will see how to construct a dataset directly yourself, but until then we will mostly used datasets that I have prepared for you. These data sets will be in a standard format, with two different files.
Today we will look at an example trying to predict the product
category that an Amazon user review is associated with. Let’s read the
two data tables into R and then talk about how they are formatted and
how they can be used. In generally, we will always use the same variable
names for these two tables: docs
and anno
(annotations).
<- read_csv("../data/amazon_product_class.csv")
docs <- read_csv("../data/amazon_product_class_token.csv.gz") anno
In each row of the docs
table one we have one row for
each product review. These correspond to the observations that we
discussed in the previous notes. In text analysis, we use the term
document to describe each observation; you will also
hear me call the entire set of documents a corpus.
Let’s take a look at the first few rows of the data:
docs
## # A tibble: 8,823 × 4
## doc_id label train_id text
## <chr> <chr> <chr> <chr>
## 1 doc00001 food valid "At some point I would like to try TeffBob's Red…
## 2 doc00002 book train "John Crowley wrote \"Little, Big\", in the wor…
## 3 doc00003 book valid "The New York Post is often credited with popula…
## 4 doc00004 film valid "INTO THIN AIR, based on Jon Krakauer's best sel…
## 5 doc00005 film train "When the Wind Blows was based upon an English n…
## 6 doc00006 food train "I have sent this basket a number of times to fa…
## 7 doc00007 book train "If you enjoy history, this book is an selection…
## 8 doc00008 film valid "Though it holds up surprisingly well over thirt…
## 9 doc00009 book train "Agatha's written is amazing. The whole story is…
## 10 doc00010 food valid "Pecans are my favorite nut, alas, they are fair…
## # … with 8,813 more rows
We see that the data contains four columns. The first one is called
doc_id
, which contains a unique key that describes each
document. Every docs
table we use will have this variable.
The next column contains the label
of each document. This
is exactly the same as what we called the label in our previous notes.
There is also a column called train_id
that has already
split the data randomly into train and validation sets. This is helpful
so that everyone is useful the exact same data for comparison purposes.
Finally, the last column is called text
; it contains the
full text of the review.
Our predictive modelling goal is to predict the label using the text. As we have discussed, we cannot directly fit a model using the text variable as a feature. Instead, we need to produce a set of numeric features that summarize the text. One of the most common methods for doing this is to use features called term frequencies. These are features that count how many times words or other linguistic element occurs in the text. To do this, we will make use of the second data table.
The anno
data table has been automatically created from
the docs
table using a set of predictive models called an
NLP pipeline. This pipeline is not the direct subject of this course,
but in later notes we will see how to apply it and create the
annotations directly. For now, we will just use the ones that I
pre-computed. Here is what the first few rows of the table look
like:
anno
## # A tibble: 1,548,904 × 10
## doc_id sid tid token token_wi…¹ lemma upos xpos tid_s…² relat…³
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 doc00001 1 1 At At at ADP IN 6 prep
## 2 doc00001 1 2 some some some DET DT 3 det
## 3 doc00001 1 3 point point point NOUN NN 1 pobj
## 4 doc00001 1 4 I I -PRO… PRON PRP 6 nsubj
## 5 doc00001 1 5 would would would VERB MD 6 aux
## 6 doc00001 1 6 like like like VERB VB 0 root
## 7 doc00001 1 7 to to to PART TO 8 aux
## 8 doc00001 1 8 try try try VERB VB 6 xcomp
## 9 doc00001 1 9 TeffBob TeffBob Teff… PROPN NNP 15 poss
## 10 doc00001 1 10 's 's 's PART POS 9 case
## # … with 1,548,894 more rows, and abbreviated variable names
## # ¹token_with_ws, ²tid_source, ³relation
We refer to each of the rows as a token, which are
either words, word parts, or punctuation marks. Notice that if you read
the values in the token
column down the column it
reconstructs the start of the first document. As this table has been
automatically constructed, the column names in the annotations table are
fairly stable across different datasets, with some occasional
additions.
For now, let’s focus on just four of the columns. The first one
contains the doc_id
that can be used to associate each
token with a document. We also see the token
column that
contains the token itself, which we can count up to create features for
the prediction task. There is also a column called lemma
which contains a standardized version of the token. For example, it
remove start-of-sentence capitalization and puts all verbs into the
infinitive. As the name suggests, this form is called a
lemma. Usually we will use the lemmas rather than the
tokens to construct a model. Finally, we see a column called
upos
, the universal part of speech code associated with the
token. These will be useful in our next set of notes.
We now have all of the data we need to construct a predictive model. You could image the following manual procedure to construct numeric features:
docs
table
that counts how often the lemma occurs in that document.docs
table that have train_id
equal to “train”.docs
table that
have train_id
equal to “valid”.In the next section, I will show you how we can do these steps using low-level R code. You’ll see that it’s not too difficult but requires a lot of temporary variables and bookkeeping. In the following section, I will show you a wrapper functions that make it so you don’t need to copy and paste all of this code every time you want to run a model.
Let’s see how we can run an elastic net using the Amazon data we
loaded above using low-level R functions. Note that in the code below I
am using the code dataset_name$variable_name
to extract a
specific variable from a specific dataset. This is needed when working
outside of verbs and ggplot commands.
To start, let’s get a vector of all the unique documents and lemmas
(standardized words) from the data using the unique()
function:
<- unique(anno$doc_id)
document_set <- unique(anno$lemma) vocab_set
Now, I will use the match()
function to create an index
to tell me with document and lemma every row of the anno
data is associated with.
<- as.numeric(match(anno$doc_id, document_set))
dindex <- as.numeric(match(anno$lemma, vocab_set)) tindex
Next, we will create a matrix, a square array of numbers. The matrix we will create has one row for each document and one column for each unique term. The numbers count how often each term occurs in a given document. Since most terms do not occur in most documents, this matrix will have a large number of zeros. To account for this, we will create a sparse matrix object that only stores the non-zero elements. Here’s the code that creates such an object and displays its dimensions:
<- Matrix::sparseMatrix(
X i = dindex, j = tindex, x = 1,
dims = c(length(document_set), length(vocab_set)),
dimnames = list(document_set, vocab_set)
)dim(X)
## [1] 8823 53756
We can simplify things by removing any terms that have only a few occurrences. Here, for example, is the code to only keep data with at least 20 instances in the data:
<- X[, colSums(X) >= 20]
X dim(X)
## [1] 8823 4405
To illustrate where we are, here are the first 12 rows and 12 columns of the data.
1:12, 1:12] X[
## 12 x 12 sparse Matrix of class "dgCMatrix"
## at some point -PRON- would like to try 's Red Mill ,
## doc00001 1 1 2 7 1 2 2 1 1 1 1 3
## doc00002 1 . . 19 . 1 6 . . . . 24
## doc00003 . . 1 9 . 1 6 . . . . 5
## doc00004 . 2 . 12 . . 6 . 2 . . 11
## doc00005 . 1 . 3 . 1 3 . . . . 1
## doc00006 . . . 2 . . 1 . . . . 1
## doc00007 . . . 9 . 1 3 . . . . 5
## doc00008 2 . . 20 . 3 8 . 2 . . 33
## doc00009 . . . 6 . . 1 . 1 . . .
## doc00010 . . . 14 1 . 1 . . . . 12
## doc00011 . . . 13 . 2 4 . 2 . . 6
## doc00012 . . . 4 . . 1 . . . . 4
Now, we need to split the matrix X into those rows in the training
set and the validation set. We also need to make a pair of vectors
y
that store the variable that we are trying to
predict.
<- X[docs$train_id == "train",]
X_train <- X[docs$train_id == "valid",]
X_valid <- docs$label[docs$train_id == "train"]
y_train <- docs$label[docs$train_id == "valid"] y_valid
Now that we have the data, we can run the elastic net module using
the function cv.glmnet
from the glmnet
package. I will set a few options to different values to make the
function run faster.
<- glmnet::cv.glmnet(
model family = "multinomial", lambda_min_ratio = 0.1, nfolds = 3,
X_train, y_train, )
Finally, we can use the predict
function to see what the
model would predict for each observation on the validation data. I’ll
use the table
function to show a confusion matrix of the
predictions and the actual responses.
<- predict(model, newx = X_valid, type = "class")
y_valid_pred table(y_valid, y_valid_pred)
## y_valid_pred
## y_valid book film food
## book 1115 58 36
## film 29 1120 29
## food 6 2 1147
Note that the cv.glmnet
function automatically performs
cross validation and returns the results using the lambda with the best
predictive power on the training data.
In the quick run through of the commands above, you can see that we can run elastic net models on textual data with a few new commands and several intermediate steps. Understanding this process is important and useful. However, it becomes clunky if you have to do all of those steps every single time you want to fit a model. Copy and pasting code quickly becomes the focus rather than understanding the data and what it is telling us. As a solution to this problem, I have written some wrapper functions that take care of the bookkeeping of running certain models. These are all provided by the **dsst* package and are prefixed with the string “dsst_”.
The main function we will use to build a predictive model is called
dsst_enet_build
. We need to pass the function the
anno
table and the docs
table. It has a number
of options that we can modify (these correspond to the options I
selected above), but for now let’s just use the default values:
<- dsst_enet_build(anno, docs) model
And that’s it! Really. The model object has two elements that contain
the glmnet model object (model$model
) and the document
table augmented with predictions from the cross-validated elastic net
(model$docs
). By default the function has created numeric
features from the 10,000 most frequently used lemmas that are used in at
least 0.1% of the documents. Let’s take a look at some of the ways that
we can evaluate the model. Next class we will look at more ways to
modify the way the model itself is built.
As a starting point, we want to see how well the model has done making predictions on the training and validation data. We can do this using standard data verbs that we learned in 289 applied to the augmented docs data. Here’s what the document table looks like:
$docs model
## # A tibble: 8,823 × 6
## doc_id label train_id text pred_…¹ pred_…²
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 doc00001 food valid "At some point I would like to t… food 0.565
## 2 doc00002 book train "John Crowley wrote \"Little, B… book 0.998
## 3 doc00003 book valid "The New York Post is often cred… book 0.943
## 4 doc00004 film valid "INTO THIN AIR, based on Jon Kra… film 0.928
## 5 doc00005 film train "When the Wind Blows was based u… film 0.682
## 6 doc00006 food train "I have sent this basket a numbe… food 0.492
## 7 doc00007 book train "If you enjoy history, this book… book 0.504
## 8 doc00008 film valid "Though it holds up surprisingly… film 0.996
## 9 doc00009 book train "Agatha's written is amazing. Th… book 0.962
## 10 doc00010 food valid "Pecans are my favorite nut, ala… food 0.843
## # … with 8,813 more rows, and abbreviated variable names ¹pred_label,
## # ²pred_value
And the error rates can be computed by grouping by the train id and summarizing the proportion of labels that match their predictions.
$docs %>%
modelgroup_by(train_id) %>%
summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.0598
## 2 valid 0.0703
We see that around 6% of the training data was mis-classified and 7% of the validation data was mis-classified. Not too bad as a first pass! We can learn a bit more by using an error rate segmented by the real label:
$docs %>%
modelgroup_by(label, train_id) %>%
summarize(erate = mean(label != pred_label))
## # A tibble: 6 × 3
## # Groups: label [3]
## label train_id erate
## <chr> <chr> <dbl>
## 1 book train 0.102
## 2 book valid 0.120
## 3 film train 0.0729
## 4 film valid 0.0789
## 5 food train 0.00618
## 6 food valid 0.00952
We see that books were the hardest to classify, with films being the
next hardest, and food being the easiest. We can get a picture of what
kind of errors are being made by using a confusion matrix with the
function table
. This function will print the
cross-tabulated counts of two or more variables. Be careful not to pass
it the whole dataset and note that the order of the variables
matters.
$docs %>%
modelselect(label, pred_label, train_id) %>%
table()
## , , train_id = train
##
## pred_label
## label book film food
## book 1566 81 96
## film 31 1629 97
## food 6 5 1770
##
## , , train_id = valid
##
## pred_label
## label book film food
## book 1064 74 71
## film 21 1085 72
## food 7 4 1144
Note that the actual labels are on the rows and the predicted labels are on the columns. We see that the book errors were evenly distributed between films and food; however, the film errors were mostly being confused with books.
Making a predictive model is great and a good sign that our model
makes sense and that the term frequencies in our model are associated
with them labels. However, our real goal is using the model to
understand the data. To do this, a key tool will be to look at the model
coefficients. We can do this with the dsst_coef
function.
By default, the values are given based off of the best model from the
cross-validation process:
dsst_coef(model$model)
## 82 x 4 sparse Matrix of class "dgCMatrix"
## book film food MLN
## (Intercept) -0.23710135 -0.005983007 0.243084361 .
## book 1.29678087 . . 2
## movie . 1.159127858 . 10
## taste . . 1.343867317 12
## read 1.05941764 . . 13
## film . 0.591468430 . 18
## flavor . . 1.183122329 20
## product . . 0.716630638 35
## watch . 0.553063391 . 37
## dvd . 0.889997431 . 41
## eat . . 0.480897421 45
## story 0.06211210 . -0.336843751 45
## the . . -0.009442920 48
## tasty . . 0.796788934 51
## delicious . . 0.659429355 53
## see . 0.206602875 . 53
## video . 0.479056328 . 54
## price . . 0.469243624 54
## snack . . 0.431084908 54
## cup . . 0.353970904 54
## who . . -0.097558210 54
## of . . -0.045304830 54
## these . . 0.330488315 55
## brand . . 0.544813559 57
## episode . 0.349504443 . 57
## novel 0.29863309 . . 58
## reader 0.55822168 . . 59
## use . . 0.277648163 60
## DVD . 0.448443090 . 63
## acting . 0.536887348 . 64
## performance . 0.308652430 . 64
## coffee . . 0.260610828 64
## classic . 0.402025321 . 65
## write 0.31135438 . . 66
## show . 0.180511772 . 66
## store . . 0.130770384 66
## an . . -0.074406741 66
## 's . . -0.030028168 67
## author 0.35035807 . . 69
## drink . . 0.311248324 69
## Klausner 1.17853114 . . 70
## fresh . . 0.316878880 71
## scene . 0.196899565 . 72
## tasting . . 0.569251636 73
## Amazon . . 0.213702688 73
## healthy . . 0.378844481 74
## package . . 0.328979910 74
## workout . 0.206264884 . 74
## tape . 0.186961381 . 74
## add . . 0.144282343 76
## actor . 0.271256472 . 77
## buy . . 0.122092941 77
## bag . . 0.168075540 78
## flavorful . . 0.377893502 84
## free . . 0.121223106 84
## by . . -0.017012223 84
## order . . 0.128617141 85
## star . 0.076446712 . 87
## flick . 0.185894801 . 88
## mix . . 0.121992066 88
## cook . . 0.073720144 89
## comedy . 0.095615737 . 90
## play . 0.010740561 . 90
## bottle . . 0.103139675 91
## series . . -0.053487185 91
## reading 0.13638857 . . 92
## box . . 0.027353742 92
## cast . 0.044758504 . 93
## mystery 0.07526256 . . 94
## size . . 0.024037498 94
## faithful . 0.280089571 . 95
## soup . . 0.055309128 95
## romance 0.04240444 . . 95
## gluten . . 0.030893505 95
## learn 0.02279078 . . 96
## sweet . . 0.018413373 96
## music . 0.016992794 . 96
## chocolate . . 0.016349201 96
## man . . -0.013443539 97
## fight . 0.009729794 . 97
## cereal . . 0.003794406 98
## tea . . 0.001644503 98
Usually, this gives too many values to easily interpret. Instead, we
want to choose a smaller value for lambda. This requires some
experimentation through setting the lambda_num
parameter,
which controls the lambda number that we use. The allowed values are
from 1 (this is the largest lambda) down to 100 (the smallest lambda).
Looking at the 10th value here produces a very small model that is easy
to interpret:
dsst_coef(model$model, lambda_num = 10)
## 3 x 4 sparse Matrix of class "dgCMatrix"
## book film food MLN
## (Intercept) -0.08882645 0.036110762 0.05271569 .
## book 0.15217251 . . 2
## movie . 0.006710871 . 10
Increasing to 20 includes more terms and a richer understanding of the classes:
dsst_coef(model$model, lambda_num = 20)
## 7 x 4 sparse Matrix of class "dgCMatrix"
## book film food MLN
## (Intercept) -0.1351490 0.04407419 0.091074811 .
## book 0.2707799 . . 2
## movie . 0.16116943 . 10
## taste . . 0.255904477 12
## read 0.1161569 . . 13
## film . 0.02195760 . 18
## flavor . . 0.007651778 20
Usually you will need to look at several different versions of the model to make interesting observations about the data.
One of the most interesting things about working with text data is
that we can go back to the model and manually read the text of
interesting observations that are identified by the model. One thing
that we can start with are looking at some negative
examples. These are records that are mis-classified by our
predictive model. We can grab a subset of these using the filter command
along with the function slice_sample
. The latter takes a
random selection of rows from the data (the results change each time you
run it).
$docs %>%
modelfilter(label != pred_label) %>%
filter(train_id == "valid") %>%
slice_sample(n = 10)
## # A tibble: 10 × 6
## doc_id label train_id text pred_…¹ pred_…²
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 doc05099 film valid "I didn't like this at all, it j… food 0.466
## 2 doc04531 book valid "Terence McKenna used to say tha… film 0.587
## 3 doc01392 book valid "Angelica Shelton Belanov has al… film 0.644
## 4 doc01614 film valid "Anyone reading my reviews on Pe… book 0.996
## 5 doc06233 book valid "I saw the first movie when I wa… film 0.975
## 6 doc07839 book valid "Though some of these are not to… food 0.707
## 7 doc05187 book valid "My very brief review: Well wri… food 0.368
## 8 doc02410 film valid "In Benny Chan's Big Bullet, Lau… food 0.529
## 9 doc01911 film valid "I know that it's is no Grease b… food 0.412
## 10 doc01115 book valid "Great drawings and wonderful cl… food 0.399
## # … with abbreviated variable names ¹pred_label, ²pred_value
It can be difficult to read the text in the print out itself. I wrote
a helper function called dsst_print_text
that prints out
the text in an easier-to-read format. We use it along with a pipe
%>%
to show all of the texts. It will also display all
of the other metadata at the top of the record.
$docs %>%
modelfilter(label != pred_label) %>%
filter(train_id == "valid") %>%
slice_sample(n = 10) %>%
dsst_print_text()
## doc05694; film; valid; food; 0.416980106952054
## Dear weelchair Assassin: There IS a sequel to this, called "Another 48
## Hrs.", & it's EQUALLY GREAT!!!
##
## doc07086; book; valid; film; 0.388108907773896
## this has some beautiful art in it. it also has some ugly art.... if
## thats what you can call it. but it does show you each artist's technic
## if not how they did each peice. so if you like anima and photoshop
## look into it. oh, thats what you are doing? well i gave it 4 stars
## didnt i. maybe three.... but i have had a good day.
##
## doc06872; film; valid; food; 0.476807298660379
## Just realized that I reviewed this product under the Qi gong workout
## which I also purchased. This is an intense workout and if you are
## not in athletic condition you may want to start off with his Qi Gong
## workout and work up to this one. It's a perfect fighting style and
## he's a wonderful master teacher.
##
## doc00071; book; valid; film; 0.395802657628196
## The story starts with the main character, Wicket (Wick for short)
## who is living with her little sister in a foster home after their
## mother's suicide and their sociopathic father's imprisonment. Wick is
## extremely protective of her you ger sister, Lily, and they have a good
## relationship. One day Wick discovers the diary of her former friend,
## who is now dead, with the words "Find Me" scribbled in it.Wick
## is a very competent computer hacker who can use her talents both for
## good and bad. She
##
## doc00848; book; valid; food; 0.701645258487129
## It is hard to imagine less rewarding place to be a police officer
## than notorious Aberdeen, Scotland. Horrendous weather, people out
## of their minds, everything possible goes wrong at every opportunity.
## Of course, in a fictional Aberdeen- I've never been to a real one.
## Stuart MacBride continues his terrific series, filled with picturesque
## characters and heavily peppered with dark, crass humor. Just my taste.
## Apology to the faint hearted or hopelessly prudish.
##
## doc01541; food; valid; book; 0.412296131391372
## If you are a residential user like me and don't happen to read that
## much, you will be surprised when two bags come to your door and
## realize you will never have to buy Garlic Powder ever again.That being
## said, it is a nice quality garlic powder and I definitely recommend
## using this.
##
## doc01472; film; valid; food; 0.391786581309792
## If you like Hollywood musicals you can't beat this collection. Six
## great shows for about five dollars each. And nobody can top Rodgers
## and Hammerstein.
##
## doc00403; book; valid; food; 0.615121491879394
## A must have book for Sondheim fans. However, many reasonably priced
## hardcover copies in excellent condition are available at much lower
## prices ($8-$20 range) than these ridiculous collector prices ($35
## at time of this post) for the softcover here. The 1986 edition in
## hardcover is the same content that is in the softcover with better
## production. So why pay more for less? It covers Pacific Overtures,
## Sweeny Todd, Merrily We Roll Along, Sunday in the Park With George,
## Sondheim Evening, Follies C
##
## doc08128; book; valid; food; 0.401462881585185
## This is a prime example of McManus at his best. It will keep you
## laughing all the way to the end. Would recommend.
##
## doc07973; book; valid; film; 0.365014737295158
## It was a replay, and not a good one, of the horrible case in Fla of
## who killed the Casey Anthony baby. Who abused whom, not good.
At the top of each text is the real label followed by the predicted label with the predicted probability of the label in parentheses. Can you understand why some of these were mis-classified?
On the other end of the spectrum, we can try to understand the model and the data by looking at the texts that have the highest predicted probabilities for its label. These are often classified correctly, but there are sometimes errors as well. We will look at these in the notebook for this class.