library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)
library(cleanNLP)
library(glmnet)
library(magrittr)
library(stringi)

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)

A Text Analysis Pipeline

This notebook is written specifically to provide an easy-resource for finding code that you will need for the next two projects. My hope is that this will help focus your group on the results of the analysis rather than hunting down code and also make it easy to see the various ways that you can play with the models and the results to explore your data. We will work today with the Amazon product review data because it is relatively simple, and the results are easy to interpret, but it is complex enough to show many of the most important features of a text analysis pipeline using the tools we have developed so far.

Note that the parts here are a process, not a checklist. You will likely need to run and investigate several models, tweaking them along the way, as you explore a corpus of text. The part names are simply to help you distinguish what each bit of code is doing.

Part 1: Read in the Data

Our first step will be reading in the data and the pre-parsed tokens. If not already provided, we will also create training and validation tags. Note that usually this block of code will be given to you.

set.seed(1)

amazon <- read_csv("data/amazon_product_class.csv") %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
token <- read_csv("data/amazon_product_class_token.csv.gz")

Note that in all of the code below you would need to change the name of the data, here amazon, to the name of your new data, and the name of the response variable, here category, to the name of the response variable in the data you are working with. Other names should be consistent if you use the data I have created.

Part 2: Create a Term Frequency (TF) Matrix

The next step is to create a term frequency matrix, the model matrix we will use to find those features that are most strongly associated with each category in the data. This uses the function cnlp_utils_tf, which has a number of options we can change (some are listed below). We can also add different types of pre-processing and filtering before passing the tokens data to the cnlp_utils_tf function. There are several different ways of creating the TF matrix, which we will separate into four sub-parts.

Part 2a: Raw Frequencies

This is the default approach, where we simply count how frequently lemmas (word forms) occur in the text.

X <- token %>%
  filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB")) %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "lemma"
  )

You can adjust min_df, max_df, and max_features to change which features are included in the model. The doc_var and token_var arguments indicate the variable names used to create the documents and create the features. You probably will not need to change these right now, but we will later.

Part 2b: N-Grams (instead of 2a)

Another option is to count sequences of words using N-grams. A 2-gram considers pairs of subsequent words, 3-grams triples, and so forth. To create N-grams we use the function sm_ngram. We can provide the maximum length of the N-grams, as well as the minimum length.

X <- token %>%
  sm_ngram(n = 2, n_min = 1, doc_var = "doc_id", token_var = "lemma") %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "token"
  )

Note that setting these values too large will create huge data sets that may be difficult for your machine to process the results.

Part 2c: Skip Grams (instead of 2a)

A skip-gram is similar to an N-gram, however they count combinations of words that appear near one another with the possibility of having some terms between them. For example, skip-grams with N equal to 2 and a skip of 1 would consist of standard 2-grams as well as pairs of words separated by a third. Here, for example is the 2-skip-gram with a skip (k) of 1:

X <- token %>%
  sm_skip_ngram(n = 2, n_min = 1, k = 1, doc_var = "doc_id", token_var = "lemma") %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "token"
  )

Even more than N-grams, be careful with trying to create models that are too large.

Part 2d: Adding Covariates (optional; use along with 2a, 2b, or 2c)

Another way of modifying the model matrix is to include a small set of additional features into the matrix along with the word counts. Typically these come from additional metadata from our corpus. However, they can also come from hand-constructed features from the texts. For example, let’s create a model matrix of covariates using the count of capital letters and numbers:

X_cov <- amazon %>%
  mutate(cnt_caps = stri_count(text, regex = "[A-Z]")) %>%
  mutate(cnt_nums = stri_count(text, regex = "[0-9]")) %>%
  model.frame(category ~ cnt_caps + cnt_nums -1, data = .) %>%
  model.matrix(attr(., "terms"), .)

Then, we combine this matrix together with the TF matrix using the function cbind:

X <- cbind(X_cov, X)

Once we have this matrix (which should still be sparse), we can fit the model as usual. This will be most useful in later projects where we have additional covariates that you may want to include.

Part 3: Build a Penalized Regression Model

Now that we have a term frequency matrix, we can fit a regression model. The first two lines create the training data (just modify the data and response names; nothing else to change here). Next, the cv.glmnet function has a number of options. We have talked about most of these. The options lambda.min.ratio and nlambda can be used to run the model faster; I usually find the best control is to play with the first, not the second, setting it a bit higher (0.02 or 0.05) will run faster.

X_train <- X[amazon$train_id == "train", ]
y_train <- amazon$category[amazon$train_id == "train"]

model <- cv.glmnet(
  X_train,
  y_train,
  alpha = 0.9,
  family = "multinomial",
  nfolds = 3,
  trace.it = TRUE,
  relax = FALSE,
  lambda.min.ratio = 0.01,
  nlambda = 100
)

I will continue to use trace.it = FALSE in the notebooks and solutions because it does not print out well when creating the HTML documents I post on the website. I recommend setting it to be TRUE in your interactive work.

Part 4: Evaluate the Model Fit

Next, we want to look at the model itself. This is the first part where we can start understanding the data we are looking at.

Part 4a: Classification Rate

We can start by seeing how well the model predicts the responses. We can do this will the following code:

amazon %>%
  mutate(pred = as.vector(predict(model, newx = X, type = "class"))) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(category == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.978
## 2 valid         0.953

It will give the classification rate for the training and validation set, using the cross-validated value of lambda

Part 4b: Segmented Error Rates

Another approach we can use is to look at how well the model predicts each class. This will be useful when looking at a large number of categories. Remember that you will need to change the variable category to the variable name in your data containing the response variable.

amazon %>%
  mutate(pred = as.vector(predict(model, newx = X, type = "class"))) %>%
  group_by(category, train_id) %>%
  summarize(class_rate = mean(category == pred)) %>%
  pivot_wider(names_from = "train_id", values_from = "class_rate") %>%
  arrange(valid)
## # A tibble: 3 x 3
## # Groups:   category [3]
##   category train valid
##   <chr>    <dbl> <dbl>
## 1 book     0.963 0.929
## 2 film     0.975 0.938
## 3 food     0.997 0.991

The code above sorts by the worst error rate. You can remove the arrange() function to show the categories in their default order.

Part 4c: Confusion Matrix

Finally, we can also look at a full confusion matrix. This helps figure out which categories are being mistaken for other categories.

amazon %>%
  mutate(pred = as.vector(predict(model, newx = X, type = "class"))) %>%
  filter(train_id == "valid") %>%
  select(category, pred) %>%
  table()
##         pred
## category book film food
##     book 1091   47   37
##     film   44 1077   27
##     food    4    7 1171

Note that I have changed this a bit from the earlier notes to only show the confusion matrix for the validation set (usually what we want anyway).

Part 5: Investigate Coefficients

Now, perhaps the most important step: looking at the model coefficients.

temp <- coef(model, s = model$lambda[15])
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 7 x 3 sparse Matrix of class "dgCMatrix"
##                   book       film       food
## (Intercept) -0.1304770 0.05186200 0.07861499
## book         0.3005243 .          .         
## read         0.1589439 .          .         
## movie        .         0.19598063 .         
## film         .         0.03793723 .         
## taste        .         .          0.30511466
## flavor       .         .          0.06587257

You should adjust the lambda number to pick a number of variables that provides just enough variables to understand what the most important variables are.

Part 6: Exploring the Coefficients with Keywords in Context (KWiC)

To understand better how words are being used in the text, we can use the the function sm_kwic. Here is the function with all of the available options:

sm_kwic("play", amazon$text, n = 15, ignore_case = TRUE, width = 30L)
##  [1] "turing the actress would later| play |Miss Brahms in the old \"Are Yo"
##  [2] " Hokey teams are tied and they| play |a game to untie the game? That" 
##  [3] "a lot more barbaric. The sword| play |is slower and more brutal. You" 
##  [4] "lls (I admit that watching her| play |is completely mesmerizing).  I" 
##  [5] "  Over-the-top then comes into| play |in the  climatic scene where E" 
##  [6] "by Robert Altman, based on the| play |by David Rabe, is an example o" 
##  [7] "nces begin to get involved and| play |a critical part  in the events" 
##  [8] "que Parent and Jennifer Burton| play |girlfriends  who get off on vo" 
##  [9] "ased wheels.Bunny didn't quite| play |the leading role I expected an" 
## [10] "been a highly successful stage| play |starring Leslie Howard and Hum" 
## [11] "entity and how do other people| play |a role in our construction of " 
## [12] " have the kid kill Keaton.In a| play |Keaton and the neighbourhood k" 
## [13] " a great line up of actors who| play |their characters really really" 
## [14] "nd jogged forward and began to| play |again. This was the 2D version" 
## [15] "nks using Havana Gooding, Jr. |(play-|ing a deaf mute with a slade t"

This selects 15 examples of the word “play” (ignoring case), and prints out 30 characters to the left and right of the word’s context.

Part 7: Exploring the Model Fit

Part 7a: Negative Examples

One way to understand how the model is working is to look at negative examples, those texts that the model is not making the correct classifications for. This code shows 10 random validation examples that are mis-classified.

amazon %>%
  mutate(pred = as.vector(predict(model, newx = X, type = "class"))) %>%
  filter(train_id == "valid") %>%
  filter(pred != category) %>%
  sample_n(size = 10) %>%
  mutate(text = stri_sub(text, 1, 500)) %>%
  mutate(response = sprintf("%s => %s \n %s\n", category, pred, text)) %>%
  use_series(response) %>%
  stri_split(fixed = "\n") %>%
  unlist() %>%
  stri_wrap(width = 79) %>%
  cat(sep = "\n")
## film => food
## i got my order quickly and iam satistifed.... no issues w. item that was
## received..so i would buy from them again
## 
## film => food
## Was a great series. What they have today, doesn't come close to the
## entertainment of yesterday. Buy it for pure western enjoyment.
## 
## book => film
## In Persuading Annie, author Melissa Nathan has paid homage to Jane Austen's
## classic novel Persuasion with not only her title but also her cast of
## characters. Annie, a young, motherless college student from a wealthy family,
## is persuaded to reconsider her plans to elope with her college sweetheart,
## Jake, by her interfering stepmother. Seven years later, Annie is still
## unhappily single, and the family business is rapidly going downhill. Re-enter
## Jake, a business consultant, to not only save the
## 
## film => book
## 1. There is a big difference between the suffering of the warmongers and their
## opponents torn out of their ordinary civilian lives into this nightmare spiral
## NOT of their own making. The suffering that the crew endured is unimaginable
## and absolutely unfair and frankly, if we are about to go to war again or if
## the Whale Warriors are in danger of getting attacked and sunk at sea, we should
## have cleared the oceans of shark life.2. Public opinion absolutely must be
## ignored - the example of McVay'
## 
## book => film
## Though the conceit is hokey, one is aroused by the narration of erotic
## liberation of prim women and their subsequent complicity in the conversion of
## others. The prose style is pseudo-Victorian, crafted fairly well -- it doesn't
## interfere with one's lusty enjoyment of the seduction scenes. Fem-libbers would
## abhor it's thesis that women need a man to unlock their sexuality. .
## 
## food => film
## This is a great substitute for corn starch and is better for you than our
## corn starch made from dented corn. You should see the documentary &#34;King
## Corn&#34;. The corn starch most is made of is from the inedible and deadly
## dented corn.
## 
## book => food
## This item arrived quickly. It arrived in excellent condition. It worked exactly
## like it was supposed to. The item met a special need very well. Thank you
## Amazon for having it.
## 
## film => food
## This is Christmas gift. What to get the guy who has everything Dallas is a
## challenge but I take gift giving seriously. He will love them. Would buy again.
## 
## film => food
## Movie Fan - Fan is almost totally correct. It is not as good as the first one.
## Anthony Edwards is terrible. He should never had tried this roll. He did not
## carry it at all.
## 
## book => film
## Shark brags about how he is not afraid of anything, but there might just be
## one thing he is afraid of!"I'm a shark! Aren't I awesome?When I get a shot, I
## don't even cry.I can watch scary movies without closing my eyes.If there were
## a dinosaur here and he saw me, you know what he would be?Scared!"Clever, funny
## text with great illustrations.

The output is formatted to show just the first 500 characters of the text, but you can increase this as needed. It also shows the correct category, followed by the predicted category.

Part 7b: Max Probability

Another way to explore the model fit is to look at the texts that have the most extreme probabilities. The code here takes the top 3 texts from each category that is given the highest probability.

pred_mat <- predict(model, newx = X, type = "response")[,,]
amazon %>%
  mutate(pred = colnames(pred_mat)[apply(pred_mat, 1, which.max)]) %>%
  mutate(prob = apply(pred_mat, 1, max)) %>%
  filter(train_id == "valid") %>%
  group_by(category) %>%
  arrange(desc(prob)) %>%
  slice_head(n = 3) %>%
  mutate(text = stri_sub(text, 1, 500)) %>%
  mutate(
    response = sprintf("%s => %s (%0.5f) \n %s\n", category, pred, prob, text)
  ) %>%
  use_series(response) %>%
  stri_split(fixed = "\n") %>%
  unlist() %>%
  stri_wrap(width = 79) %>%
  cat(sep = "\n")
## book => book (1.00000)
## I recently read Hume's famous bookA Treatise of Human Nature(first published
## in 1738), and I thought I would follow it up by reading Whitehead's famous
## bookProcess and Reality: An Essay in Cosmology(written for the 1927-28 Gifford
## Lectures). I doubt that reading Whitehead's book was the best use of my time (I
## was carried away by my curiosity), so I'll write a little review that may help
## you decide if it's worth your time.Whitehead says in the book's preface: "These
## lectures are based upon a recu
## 
## book => book (1.00000)
## The titles of this Peter Berley book, `Fresh Food Fast' combine at least four
## culinary catch phrases in the space of eleven words. `Fresh', `Seasonal',
## `Fast', and `Vegetarian', supported by `Meals in Under an Hour' promise to hit
## as many cookbook buyers' hot buttons as possible. The style of the book is to
## do Rachael Ray one better in the fast menu planning department by addressing
## a common criticism of her '30 Minute Meals' books. Berley or co-author Melissa
## Clark addresses this by providing a
## 
## book => book (1.00000)
## _Works of Love_ by Kierkegaard is the most uplifting, encouraging, and hope-
## restoring book I have ever read. Kierkegaard's statement that "the greatest
## act of love anyone can ever achieve is to mourn for someone who is dead" is a
## statement I have used to guide myself through innumerable existential crises
## and has given me hope in my darkest hours. The wisdom contained in this book is
## an essential tool in dealing with the premature and untimely death of a loved
## one, and restoring your hope and
## 
## film => film (1.00000)
## Yes, "Voices" is a love story, but it's not a glittery fairy tale in the
## style of current romantic films. Rather it's about the romance one can find in
## reality. It's the story of two ordinary residents of the gritty working class
## city of Hoboken, NJ who lived in two different worlds.Drew Rithman, (Ontkean)
## is a truck driver who dreamed of becoming a rock star. Rosemarie Lemon,
## (Irving) was a teacher from an overprotective upper middle class family. She
## longed to become a professional dancer. Bot
## 
## film => film (1.00000)
## Words can't even describe how much I love this Phantom Of The Opera film! When
## my mother borrowed it from the library back in 2003, I was a bit skeptical to
## watch it because before watching this one, the only other one I saw which was
## in the 90's, was some horror version which I think was Dario Argento's version
## and it was so boring and the main storyline bored me. I hated it so I wasn't
## jumping for joy when I saw that my mom had borrowed this 1990 version. I know
## if I started with the Lon Chane
## 
## film => film (1.00000)
## In 1943 Errol Flynn was in trouble. His best days were behind him - Captain
## Blood (1935), Major Vickers (1936), Robin Hood (1938), the Earl of Essex
## (1939), George Armstrong Custer (1941), and Gentleman Jim Corbett (1942) were
## done and gone. He was being criticized for not being in the war (he became a
## naturalized US citizen in 1942), and he was defending himself against charges
## of statutory rape. He was near the apex and a steep decline was ahead of him.
## Apart from "Objective Burma" (1945), he
## 
## food => food (1.00000)
## GENERAL: No, these will not make you high mentally but yet I get high
## physically since three tablespoon of this beauty, once again only 3 tablespoon
## which is 30 gr, offer me 50% magnesium and 25% zinc which allows me to get to
## really high spots in the mountains with my mountain bike or catch those big
## high waves in the ocean since my muscles do not cramp due to magnesium intake
## of hemp seeds. What an absolute joke, in US they continuously talk about health
## care but yet this one creation of natur
## 
## food => food (1.00000)
## Delicious olives, and lots of them, simply processed, without excess
## favorings. No hot peppers, no excess garlic.The similar "Byzantine Fresh Olive
## Antipasto"Byzantine Fresh Olive Antipasto, 5-Pound Bagis the same product but
## with tasty capers, a few garlic cloves and a few peices of bell peppers (for
## color), perhaps a slightly larger ratio of ripe olives, and perhaps saltier.
## I like the flavor of the "Byzantine Fresh Olive Antipasto" better, but the
## "Country Olive Mix" is even better, with my o
## 
## food => food (1.00000)
## I agree with the reviewer who gave these a stellar rating. We've tried a
## variety of pasta and noodle alternatives (spaghetti squash etc) and the
## results ranged from interesting to...terrible. The spaghetti squash was the
## best alternative but it certainly didn't taste like regular pasta. But regular
## pasta plays havoc with our blood sugar and whole wheat pasta wasn't to our
## liking...so we continued our search.These Tofu Shirataki Fettucini are the
## absolute alternative to pasta which we've found an

Usually these will be correctly labeled, but it is possible that there are some errors still.