library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)
library(cleanNLP)
library(glmnet)

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)

Amazon Product Classification

In these notes we will look a data set of product reviews from Amazon. Reviews come from one of three different categories (Books, Films, and Food); the classification task we will investigate is the classification of items into one of these three categories.

set.seed(1)

amazon <- read_csv("data/amazon_product_class.csv") %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
amazon
## # A tibble: 8,823 x 4
##    doc_id   category text                                            train_id
##    <chr>    <chr>    <chr>                                           <chr>   
##  1 doc00001 food     "At some point I would like to try TeffBob's R… valid   
##  2 doc00002 book     "John Crowley wrote  \"Little, Big\", in the w… valid   
##  3 doc00003 book     "The New York Post is often credited with popu… train   
##  4 doc00004 film     "INTO THIN AIR, based on Jon Krakauer's best s… train   
##  5 doc00005 film     "When the Wind Blows was based upon an English… train   
##  6 doc00006 food     "I have sent this basket a number of times to … valid   
##  7 doc00007 book     "If you enjoy history, this book is an selecti… train   
##  8 doc00008 film     "Though it holds up surprisingly well over thi… valid   
##  9 doc00009 book     "Agatha's written is amazing. The whole story … train   
## 10 doc00010 food     "Pecans are my favorite nut, alas, they are fa… valid   
## # … with 8,813 more rows

We will use a more sophisticated method for producing TF matrices and see how to adapt our text classification methods for when there are more than two categories.

Natural language processing (NLP)

As we did in the previous notes, we will use the cleanNLP package to split the textual data into a format with one row for each token (word or punctuation mark). Last time we used the stringi backend, which is fast but error-prone. This time we will use a library called spacy for extracting linguistic features from the text. Running this backend can be done with the following code:

cnlp_init_spacy("en_core_web_sm")
token <- cnlp_annotate(amazon)$token

However, it requires having the Python library spacy already installed and set up on your machine. It also can take a few minutes to finish parsing the text. As an alternative, we will just load in the pre-computed data here (and I will provide similar parsed data for the lab and projects):

token <- read_csv("data/amazon_product_class_token.csv.gz")
token
## # A tibble: 1,548,904 x 10
##    doc_id   sid   tid token token_with_ws lemma upos  xpos  tid_source
##    <chr>  <dbl> <dbl> <chr> <chr>         <chr> <chr> <chr>      <dbl>
##  1 doc00…     1     1 At    At            at    ADP   IN             6
##  2 doc00…     1     2 some  some          some  DET   DT             3
##  3 doc00…     1     3 point point         point NOUN  NN             1
##  4 doc00…     1     4 I     I             -PRO… PRON  PRP            6
##  5 doc00…     1     5 would would         would VERB  MD             6
##  6 doc00…     1     6 like  like          like  VERB  VB             0
##  7 doc00…     1     7 to    to            to    PART  TO             8
##  8 doc00…     1     8 try   try           try   VERB  VB             6
##  9 doc00…     1     9 Teff… TeffBob       Teff… PROPN NNP           15
## 10 doc00…     1    10 's    's            's    PART  POS            9
## # … with 1,548,894 more rows, and 1 more variable: relation <chr>

There is a lot of information that has been automatically added to this table, the collective results of decades of research in computational linguistics and natural language processing. Each row corresponds to a word or a punctuation mark, along with metadata describing the token. Notice that reading down the column token reproduces the original text. The columns available are:

  • doc_id: A key that allows us to group tokens into documents and to link back into the original input table.
  • sid: Numeric identifier of the sentence number.
  • tid: Numeric identifier of the token within a sentence. The first three columns form a primary key for the table.
  • token: A character variable containing the detected token, which is either a word or a punctuation mark.
  • token_with_ws: The token with whitespace (i.e., spaces and new-line characters) added. This is useful if we wanted to re-create the original text from the token table.
  • lemma: A normalized version of the token. For example, it removes start-of-sentence capitalization, turns all nouns into their singular form, and converts verbs into their infinitive form.
  • upos: The universal part of speech code, which are parts of speech that can be defined in (most) spoken languages. These tend to correspond to the parts of speech taught in primary schools, such as “NOUN”, “ADJ” (Adjective), and “ADV” (Adverb). The full set of codes and their meaning can be found here: Universal POS tags.
  • xpos: A fine-grained part of speech code that depends on the specific language (here, English) and models being used. You can find more information here: spaCy POS tags
  • tid_source The token id of the word in the sentence that this token is grammatically related to. Relations always occur within a sentence, so there is no need for a seperate indication of the source sid.
  • relation: The name of the relation implied by the tid_source variable. Allowed relations differ slightly across models and languages, but the core set are relatively stable. The codes in this table are Universal Dependencies.

There are many analyses that can be performed on the extracted features in this table. We will look at a few here and expand on them over the next few weeks.

Fitting a model

Before delving into the new variables in the tokens table, let’s start by replicating the analysis we did last time with the spam data for the Amazon product category. Because this is a larger data set, we will set some additional parameters to the cnlp_utils_tf function to include only those terms that are in at least 0.1% of the corpus and no more than 50% of the corpus. We can also set the maximum number of terms that will be included.

X <- token %>%
  cnlp_utils_tf(doc_set = amazon$doc_id,
                min_df = 0.001,
                max_df = 0.5,
                max_features = 10000)

dim(X)
## [1] 8823 6871

If there are more that max_features terms that fall between the allowed frequencies, the most frequent of these will be used. As before, we will create training data and run the glmnet function.

X_train <- X[amazon$train_id == "train", ]
y_train <- amazon$category[amazon$train_id == "train"]

model <- cv.glmnet(
  X_train, y_train, alpha = 0.2, family = "multinomial", nfolds = 3
)

Like the spam model, our penalized regression does a very good job of classifying products. The training set is nearly 99% accurate and the validation set is 96.5% accurate.

amazon %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(category == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.989
## 2 valid         0.965

Looking at a confusion matrix, we see that the few mistakes that do occur happen when books and films are confused with one another. Can you think of why this might happen?

amazon %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  select(category, pred, train_id) %>%
  table()
## , , train_id = train
## 
##         pred
## category book film food
##     book 1745   19   13
##     film   14 1762   11
##     food    0    3 1751
## 
## , , train_id = valid
## 
##         pred
## category book film food
##     book 1114   40   21
##     film   44 1091   13
##     food    1    3 1178

Notice that each category has its own coefficients.

temp <- coef(model, s = model$lambda[22])
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 12 x 3 sparse Matrix of class "dgCMatrix"
##                     book       film         food
## (Intercept) -0.055797141 0.02218407  0.033613070
## book         0.135159825 .          -0.001795063
## read         0.175900860 .           .          
## story        .           .          -0.001530181
## movie        .           0.11408253  .          
## film         .           0.05474195  .          
## taste        .           .           0.226705692
## watch        .           0.10448992  .          
## write        0.005048023 .           .          
## flavor       .           .           0.157245530
## product      .           .           0.046148874
## dvd          .           0.04153220  .

You should see that the words that come out of the model match our intuition for what words would be associated with each product type.

POS tags

Now, let’s see what we can do by making use of the part of speech tags in the tokens data. To start, we can get a sense of the most common lemmas associated with each universal part of speech:

token %>%
  group_by(upos, lemma) %>%
  summarize(sm_count()) %>%
  arrange(desc(count)) %>%
  slice_head(n = 8) %>%
  summarize(sm_paste(lemma))
## # A tibble: 18 x 2
##    upos  lemma_paste                                              
##    <chr> <chr>                                                    
##  1 ADJ   "good; great; other; more; many; first; little; old"     
##  2 ADP   "of; in; for; to; with; on; from; by"                    
##  3 ADV   "so; just; very; when; well; really; also; even"         
##  4 AUX   "be; have; do; get; am; are"                             
##  5 CCONJ "and; but; or; &; so; both; yet; either"                 
##  6 DET   "the; a; -PRON-; this; that; an; all; some"              
##  7 INTJ  "well; yes; like; oh; no; please; wow; anyway"           
##  8 NOUN  "book; movie; film; time; story; character; life; way"   
##  9 NUM   "one; two; three; 2; 3; 5; 1; 4"                         
## 10 PART  "to; not; 's; '; s; -PRON-; \"you; can't"                
## 11 PRON  "-PRON-; who; what; there; something; nothing; anyone; i"
## 12 PROPN "Amazon; John; DVD; New; Mr.; James; God; Hollywood"     
## 13 PUNCT ".; ,; \"; -; (; ); !; ;"                                
## 14 SCONJ "as; that; if; like; than; because; while; since"        
## 15 SPACE "NA"                                                     
## 16 SYM   "/; #; $; -; +; =; :; ........"                          
## 17 VERB  "make; will; can; would; see; read; find; go"            
## 18 X     "etc; >; i.e.; de; .; 1; don't; 2"

Then, we can use these codes to filter the data to include only certain parts of speech in our model. For example, we can look at only verbs:

X <- token %>%
  filter(upos == "VERB") %>%
  cnlp_utils_tf(doc_set = amazon$doc_id,
                min_df = 0.001,
                max_df = 0.5,
                max_features = 10000)

dim(X)
## [1] 8823 1215

The model has significantly fewer variables now. We then train the data as before:

X_train <- X[amazon$train_id == "train", ]
y_train <- amazon$category[amazon$train_id == "train"]

model <- cv.glmnet(
  X_train, y_train, alpha = 0.2, family = "multinomial", nfolds = 3
)

How well does the model do in predicting the category of the product? It’s okay, but not as good as the original model:

amazon %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(category == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.860
## 2 valid         0.785

But keep in mind that our goal is not to find the most predictive model. Rather, we want to use the predictive model to understand the data. Filtering on just verbs does just that. We can, for example, see what verbs as associated with each category:

temp <- coef(model, s = model$lambda[14])
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 14 x 3 sparse Matrix of class "dgCMatrix"
##                     book         film         food
## (Intercept) -0.052799883  0.006107042  0.046692840
## see          .            0.099763854 -0.047794371
## read         0.361814715 -0.015469013 -0.038575636
## use          .            .            0.015498141
## buy          .            .            0.011046697
## watch        .            0.337398714 -0.004667999
## write        0.213694568  .            .          
## add          .            .            0.029662503
## play         .            0.176552766  .          
## taste        .            .            0.470674670
## eat          .            .            0.152733992
## learn        0.002985313  .            .          
## order        .            .            0.020065803
## release      .            0.007553405  .

As with the first model matrix, most of these should seem reasonable to you based on the categories.