Class 14: Hail to the Chief

2017-10-12

library(readr)
library(dplyr)
library(ggplot2)
library(methods)
library(stringi)
library(smodels)
library(tokenizers)

Mid-course reviews

Thank you all for the really great feedback on the course. It seem that things are generally going well. A few themes I noticed:

  • help with how to select tuning parmeters and more on what they are doing
  • more references to other material
  • posting more student solutions
  • perhaps some more interactive class time
  • labs first works well
  • enjoying the datasets
  • pace seems about right

For the first note, I’ll reiterate that I try to explain roughly what each parameter is directly doing to the model. Why we might change it and what the best value should be a is a hard problem. Generally I use the validation set to tell me good values. Hopefully over time you gain some intuition from seeing lots of problems what values work well.

For the second, I should do a better job with this. Here are a few references that are really great for understanding the main packages we have been using:

If you want more general information on R, here’s a great reference:

Now, if you would like more theoretical notes about all of the topics in this case I would recommend two books. Both of these have free pdf versions available online:

Both are large pdfs. The second one is much more theoretical than the first.

Once we get to neural networks for text and image processing I will point you all to many more papers and references as these two fields have changed rapidly over the past 5 years and it is hard for textbooks to keep up.

If there is anything I can help you in learning the material for class, please let me know!

Amazon Classification – Take Two

Let’s look at the Amazon classification data again:

amazon <- read_csv("https://statsmaths.github.io/ml_data/amazon_product_class.csv")

For the lab, you all should have created features that you thought might be useful to classification task. Can we do this in a more systematic approach? Yes!

The basic approach is to create features for all (or the most common) words that occur in a collection of texts and then use a technique such as the elastic net that can be run with a large number of variables.

The first step is to use the tokenize_characters function from the tokenizers package to break each text into characters:

token_list <- tokenize_words(amazon$text)
token_list[10]
## [[1]]
##   [1] "pecans"      "are"         "my"          "favorite"    "nut"        
##   [6] "alas"        "they"        "are"         "fairly"      "expensive"  
##  [11] "i"           "have"        "purchased"   "pecans"      "from"       
##  [16] "many"        "sources"     "and"         "the"         "real"       
##  [21] "trick"       "is"          "to"          "get"         "fresh"      
##  [26] "nuts"        "not"         "ones"        "that"        "have"       
##  [31] "sat"         "around"      "for"         "a"           "season"     
##  [36] "these"       "pecans"      "arrived"     "quickly"     "were"       
##  [41] "packaged"    "well"        "came"        "in"          "2"          
##  [46] "heavy"       "non"         "resealable"  "bags"        "they"       
##  [51] "are"         "as"          "good"        "as"          "the"        
##  [56] "fancy"       "pecans"      "i"           "have"        "paid"       
##  [61] "a"           "lot"         "more"        "for"         "direct"     
##  [66] "from"        "the"         "farm"        "they"        "are"        
##  [71] "very"        "clean"       "with"        "very"        "little"     
##  [76] "bitter"      "34"          "trash"       "34"          "as"         
##  [81] "is"          "found"       "in"          "many"        "supermarket"
##  [86] "pecans"      "the"         "nuts"        "are"         "moist"      
##  [91] "and"         "sweet"       "but"         "nuts"        "are"        
##  [96] "seasonal"    "and"         "the"         "longer"      "they"       
## [101] "sit"         "after"       "harvest"     "the"         "drier"      
## [106] "they"        "get"         "so"          "i"           "assume"     
## [111] "ordering"    "these"       "late"        "in"          "the"        
## [116] "season"      "would"       "result"      "in"          "drier"      
## [121] "smaller"     "nuts"        "i"           "will"        "purchase"   
## [126] "them"        "again"       "the"         "price"       "is"         
## [131] "good"        "and"         "i"           "can't"       "grow"       
## [136] "them"        "where"       "i"           "live"

We then use the term_list_to_df function from smodels to convert this into a feature data frame with one token per row:

token_df <- term_list_to_df(token_list)
filter(token_df, id == 10)
## # A tibble: 139 x 2
##       id     token
##    <int>     <chr>
##  1    10    pecans
##  2    10       are
##  3    10        my
##  4    10  favorite
##  5    10       nut
##  6    10      alas
##  7    10      they
##  8    10       are
##  9    10    fairly
## 10    10 expensive
## # ... with 129 more rows

And, finally, we can get a response matrix giving counts for each of these characters with the smodels function term_df_to_matrix. In theory, the result has one row per spam message and one column for each unique character. However, by default, only the most frequent 10000 terms are used (usually those rare terms are proper nouns or mis-spellings that occur in only 1 or 2 texts anyway):

X <- term_df_to_matrix(token_df)
dim(X)
## [1]  9000 10000

Here are the first few rows and columns. The columns are ordered by frequency.

colnames(X)[c(1:10, 300, 1000, 5000)]
##  [1] "the"       "and"       "a"         "of"        "to"       
##  [6] "is"        "i"         "in"        "it"        "this"     
## [11] "free"      "share"     "gentleman"
X[1:10,c(1:10, 300, 1000, 5000)]
## 10 x 13 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 13 column names 'the', 'and', 'a' ... ]]
##                                    
##  [1,]  .  .  1  4 2 1 5 . 3 . . . .
##  [2,] 13  9 11 13 7 7 3 4 3 2 . . .
##  [3,]  4  2  6  3 6 1 . 1 1 1 . . .
##  [4,] 12  5  2  6 6 5 1 6 1 2 . . .
##  [5,]  6  2  5  4 3 . 2 1 . 2 . . .
##  [6,]  .  2  1  1 1 . 2 . . 1 . . .
##  [7,]  2  1  3  2 3 1 3 . 2 1 . . .
##  [8,] 24 12 12  9 8 4 . 6 3 . . . .
##  [9,]  3  3  1  2 1 2 1 2 2 2 . . .
## [10,]  8  4  2  . 1 3 6 4 . . . . .

So, for example, the 10th text used the word “the” 8 times. Notice that this function creates a sparse matrix, where non-zero terms are not show. This is very useful here because over 99% of the terms in the matrix X are zero:

mean(X == 0)
## [1] 0.9917839

Using this matrix as a design matrix, we can then create our usual training and validation sets:

y <- amazon$category
X_train <- X[amazon$train_id == "train",]
X_valid <- X[amazon$train_id == "valid",]
y_train <- y[amazon$train_id == "train"]
y_valid <- y[amazon$train_id == "valid"]

As mentioned above, the elastic net is a fantastic model for this problem.

library(glmnet)
model <- cv.glmnet(X_train, y_train, nfolds = 5, family = "multinomial")
beta <- coef(model, s = model$lambda.min)
beta <- Reduce(cbind, beta)
beta[apply(beta != 0, 1, any),]
## 85 x 3 sparse Matrix of class "dgCMatrix"
##                       1            1             1
## (Intercept) -0.20803523  0.276301426 -0.0682661948
## the          .          -0.015061070  .           
## of           .          -0.064302976  .           
## one          .           .            0.0002073049
## book         1.44961914  .            .           
## my           .           0.060632161  .           
## an           .          -0.041219116  .           
## who          .          -0.068278898  .           
## good        -0.02503108  .            .           
## movie        .           .            1.4156820628
## film         .           .            0.8385459558
## story        .          -0.424950615  .           
## these        .           0.445995034  .           
## read         1.04503252  .            .           
## see          .           .            0.0066460869
## characters   0.17577005  .            .           
## taste        .           1.439752169  .           
## series       .          -0.016616493  .           
## books        0.71690838  .            .           
## flavor       .           1.349980299  .           
## use          .           0.443949844  .           
## dvd          .           .            1.0812829166
## reading      0.51808774  .            .           
## coffee       .           0.185412827  .           
## author       0.40764428  .            .           
## buy          .           0.098545049  .           
## product      .           0.799492674  .           
## movies       .           .            0.8816641498
## fun          .          -0.086597555  .           
## novel        0.19159722  .            .           
## watch        .           .            0.6475057245
## written      0.13068056  .            .           
## try          .           0.112499915  .           
## price        .           0.512040200  .           
## show         .           .            0.0330752800
## writing      0.26493186  .            .           
## stories      0.09670874  .            .           
## sweet        .           0.046613059  .           
## chocolate    .           0.001380139  .           
## add          .           0.246338098  .           
## amazon       .           0.215266589  .           
## sugar        .           0.098411498  .           
## tried        .           0.352348556  .           
## acting       .           .            0.4270787818
## classic      .           .            0.5861833935
## watching     .           .            0.4124209350
## cup          .           0.157527795  .           
## box          .           0.133565716  .           
## eat          .           0.457146599  .           
## cast         .           .            0.2128279713
## reader       0.16885496  .            .           
## fresh        .           0.406011129  .           
## video        .           .            0.7396884301
## mix          .           0.022023183  .           
## bag          .           0.378283181  .           
## tastes       .           0.928814542  .           
## delicious    .           0.674937629  .           
## store        .           0.242036573  .           
## comedy       .           .            0.0712753949
## romance      0.17167010  .            .           
## milk         .           0.170693984  .           
## actors       .           .            0.4974024388
## drink        .           0.056857168  .           
## brand        .           0.134949746  .           
## healthy      .           0.375010237  .           
## snack        .           0.332254425  .           
## episode      .           .            0.3655886305
## tasty        .           0.775434905  .           
## flavors      .           0.376780724  .           
## watched      .           .            0.0649728687
## texture      .           0.100550956  .           
## expensive    .           0.202212457  .           
## tasting      .           0.621205123  .           
## tape         .           .            0.1527395777
## episodes     .           .            0.1505302679
## flick        .           .            0.1881490256
## flavored     .           0.216394219  .           
## workout      .           .            0.1849329789
## subscribe    .           0.103672374  .           
## ordering     .           0.056434886  .           
## salty        .           0.046164236  .           
## flavorful    .           0.231721673  .           
## bulk         .           0.157663761  .           
## bottles      .           0.034093371  .           
## klausner     1.40389380  .            .

Neat, right? What are some interesting patterns you see in the output?

Let’s see how well does this model stack up to the predictions from today.

amazon_full <- read_csv("~/gh/ml_data_full/amazon_product_class.csv")

amazon_full$category_pred <- predict(model, newx = X, type = "class")
tapply(amazon_full$category_pred == amazon_full$category,
       amazon_full$train_id, mean)
##      test     train     valid 
## 0.9188889 0.9270370 0.9233333

As you all wait to hand in the assignment at the last possible minute, I don’t know yet, but hopefully fairly well.

Authorship Detection

Our text prediction task for today involves something called authorship detection. Given a short snippet of text, predict who wrote or said it. The data for today comes from State of the Union Addresses. Each speech was broken up into small snippets; we want to detect which president is associated with each bit of text.

president <- read_csv("https://statsmaths.github.io/ml_data/presidents_3.csv")

Here, I gave a class name in addition to a numeric value to make it easy to look at the results. The data comes from our past three presidents:

table(president$class_name)
## 
##       Barack Obama     George W. Bush William J. Clinton 
##                317                319                329

For example, here is a bit of text from George W. Bush:

{
  cat(president$class_name[1000])
  cat("\n-----------------------------------\n")
  cat(stri_wrap(president$text[1000], width = 60), sep = "\n")
}
## George W. Bush
## -----------------------------------
## Iraqis are showing their courage every day, and we are proud
## to be their allies in the cause of freedom. Our work in Iraq
## is difficult because our enemy is brutal. But that brutality
## has not stopped the dramatic progress of a new democracy.
## In less than 3 years, the nation has gone from dictatorship
## to liberation, to sovereignty, to a Constitution, to
## national elections. At the same time, our coalition has been
## relentless in shutting off terrorist infiltration, clearing
## out insurgent strongholds, and turning over territory to
## Iraqi security forces.

For fairness, the train/valid/test split was done by year. That is, every snippet in a particular year’s speech was in exactly one of the three groups. This prevents us from learning features that will not be useful outside of the corpus (such as particular names of people or specific issues that were relevant only at one moment in time).

Let’s build another feature matrix using the words in this corpus. I’ll put all the steps in one block to make it easier for you to copy and adapt in your own work.

token_list <- tokenize_words(president$text)
token_df <- term_list_to_df(token_list)
X <- term_df_to_matrix(token_df, min_df = 0.01, max_df = 0.90,
                       scale = TRUE)

y <- president$class
X_train <- X[president$train_id == "train",]
X_valid <- X[president$train_id == "valid",]
y_train <- y[president$train_id == "train"]
y_valid <- y[president$train_id == "valid"]
dim(X)
## [1] 1200 1027

I have added the options min_df and max_df to filter to only terms that are used in at least 1% of the documents but no more than 90% of the documents. These filters help make the computations much faster. I also scaled the rows of the data; this makes the counts frequencies rather than occurances. Generally, I play around with whether that improves my model or not.

We’ll use the elastic net again:

library(glmnet)
model <- cv.glmnet(X_train, y_train, nfolds = 5, family = "multinomial")

I toyed with the choice of lambda to find a good set that showed what features the model is picking up the most:

beta <- coef(model, s = model$lambda[15])
beta <- Reduce(cbind, beta)
beta[apply(beta != 0, 1, any),]
## 54 x 3 sparse Matrix of class "dgCMatrix"
##                        1             1            1
## (Intercept)  -0.08839805  4.713983e-02  0.041258223
## that          8.56882952  .             .          
## is            .           7.992963e+00  .          
## i             .           .             1.252293202
## all           .           .             2.182879438
## people        .           .             0.440000480
## but           .          -1.479305e+01  .          
## must        -13.20984353  .             .          
## do            .          -6.229348e+00  .          
## now           6.54584812  .             .          
## should        .           .             2.895454083
## jobs         14.98101138 -4.979440e+00  .          
## security      .           1.066791e+01  .          
## that's       20.36644887  .             .          
## great         .           1.306707e+01  .          
## say           .           .            19.144161510
## why           1.30683174  .             .          
## businesses    0.12665860  .             .          
## cut           .           .             8.223972086
## where         2.81511181  .             .          
## national      .           .             4.439548768
## don't         .          -2.609729e+00  .          
## welfare       .           .            15.926791480
## social        .           7.222435e+00  .          
## i'm           5.07671650  .             .          
## iraq          .           2.462884e+01  .          
## terrorists    .           1.282753e+01  .          
## yet           .           2.240375e+01  .          
## men           .           .            -0.009470675
## we'll         2.14073953  .             .          
## very          .           .            32.525272889
## program       .           .             0.989088279
## think         .           .             8.433953621
## terror        .           3.027218e+01  .          
## going         .           .             7.848648437
## interest      .           .            10.195887893
## relief        .           1.245051e+01  .          
## democrats     3.83367210  .             .          
## crisis        1.92848359  .             .          
## funding       .           8.603239e-02  .          
## ought         .           .            54.168384942
## cold          .           .            30.204233838
## personal      .           1.387599e+00  .          
## saddam        .           1.968808e+01  .          
## liberty       .           9.502595e+00  .          
## hussein       .           7.904885e-06  .          
## really        .           .             2.882634304
## defend        .           5.196453e-01  .          
## race         13.93866591  .             .          
## compassion    .           7.416588e+00  .          
## duty          .           1.062364e+01  .          
## solar         0.35937764  .             .          
## sound         .           2.531681e+00  .          
## current       .           1.740573e+01  .

At least some of these should seem unsurprising.

president$class_pred <- predict(model, newx = X, type = "class")
tapply(president$class_pred == president$class,
       president$train_id, mean)
##      test     train     valid 
##        NA 0.9262673 0.7388535

The classification rate is not bad either, though there is clearly overfitting. This is likely due to use of cross-validation over a non-random train/validation split.

Other Tokens

One major change we can make to the above modelling approach is to split the text into something other than words. One common altnerative is to tokenize into ngrams; these are groups of n concurrent words rather than individual ones. Using two words togther is called a bigram, three a trigram, and so forth. We can access these using the tokenize_ngrams function. By setting n_min equal to 1, I make sure to also get the single words (or unigrams):

token_list <- tokenize_ngrams(president$text, n_min = 1, n = 2)
token_df <- term_list_to_df(token_list)
X <- term_df_to_matrix(token_df, min_df = 0.01, max_df = 0.90,
                       scale = TRUE)

y <- president$class
X_train <- X[president$train_id == "train",]
X_valid <- X[president$train_id == "valid",]
y_train <- y[president$train_id == "train"]
y_valid <- y[president$train_id == "valid"]
dim(X)
## [1] 1200 1852

Because of the aggresive filtering rules, the number of included bigrams is not too much larger than the original data matrix. Here are some of the bigrams that were included:

set.seed(1)
sample(colnames(X), 25)
##  [1] "half"        "leave"       "enemy"       "each of"     "since"      
##  [6] "advanced"    "stopped"     "character"   "build a"     "energy"     
## [11] "of america"  "know that"   "say that"    "common"      "reason"     
## [16] "willing"     "make our"    "senior"      "work and"    "takes"      
## [21] "new markets" "public"      "race"        "continue"    "but i"

We can use this new data matrix as before by passing it to the glmnet function.

library(glmnet)
model <- cv.glmnet(X_train, y_train, nfolds = 5, family = "multinomial")

Here are the terms picked out by the elastic net model:

beta <- coef(model, s = model$lambda[15])
beta <- Reduce(cbind, beta)
beta[apply(beta != 0, 1, any),]
## 68 x 3 sparse Matrix of class "dgCMatrix"
##                             1             1             1
## (Intercept)     -8.043148e-02  2.382003e-02  5.661145e-02
## that             1.147351e+01  .             .           
## is               .             1.096880e+01  .           
## i                .             .             6.817890e-01
## but              6.471316e-01 -1.766330e+01  .           
## must            -2.046867e+00  .             .           
## do               .            -6.739337e+00  .           
## now              6.610564e+00  .             .           
## we must         -2.354904e+01  .             .           
## jobs             1.961361e+01 -4.916850e+00  .           
## security         .             1.856780e+01  .           
## that's           2.643896e+01  .             .           
## we should        .             .             2.489527e+00
## great            .             1.839072e+01  .           
## say              .             .             2.095896e+01
## businesses       4.294011e-01  .             .           
## cut              .             .             1.116984e+01
## where            1.620573e+00  .             .           
## our country     -3.547678e+00  1.119496e+01  .           
## all of           .             .             1.141197e+01
## the united       .             2.260267e+00  .           
## don't            .            -2.091926e+00  .           
## of a             4.259997e+00  .             .           
## welfare          .             .             2.117819e+01
## iraq             .             3.065226e+01  .           
## terrorists       .             1.409994e+01  .           
## yet              .             2.730007e+01  .           
## men              .             .            -1.203862e+00
## social security  .             3.761022e+00  .           
## we'll            3.676765e+00  .             .           
## so that          .             .             3.042030e+01
## very             .             .             3.782851e+01
## think            .             .             8.859004e+00
## terror           .             4.034712e+01  .           
## interest         .             .             9.591404e+00
## relief           .             1.516102e+01  .           
## the new          .             .             3.522391e+01
## crisis           1.020524e+00  .             .           
## funding          .             2.898070e+00  .           
## going to         .             .             2.689956e+01
## people to        .             .             6.151514e+00
## innovation       1.090794e+00  .             .           
## that has         1.606313e+01  .             .           
## are not          .             1.596274e+01  .           
## ought            .             .             6.802820e+01
## ought to         .             .             1.716341e-14
## fellow citizens  .             1.943765e+01  .           
## fact             1.328106e+00  .             .           
## my fellow        .             .             1.577682e+01
## necessary        2.098427e-01  .             .           
## the terrorists   .             4.950841e+00  .           
## at a             .             2.595681e+00  .           
## personal         .             1.738921e+00  .           
## saddam           .             2.419539e+01  .           
## liberty          .             9.154957e+00  .           
## invest in        1.327780e+01  .             .           
## hussein          .             1.484739e-01  .           
## saddam hussein   .             1.397277e-13  .           
## and other        .             5.878574e+00  .           
## race             1.729744e+01  .             .           
## compassion       .             8.853328e+00  .           
## democrats and    2.751797e+01  .             .           
## duty             .             9.536758e+00  .           
## and republicans  1.195750e-13  .             .           
## our deficit      4.967789e+01  .             .           
## the cold         .             .             7.783723e+01
## sound            .             4.652890e+00  .           
## current          .             2.032678e+01  .

There are certainly some interesting bigrams here, such as “saddam hussein”, and “social security”. The first one is mostly useful in describing to us what the model has found; the second probably helps to distinguish the seperate meanings of “social”, “security”, and “social security”.

president$class_pred <- predict(model, newx = X, type = "class")
tapply(president$class_pred == president$class,
       president$train_id, mean)
##      test     train     valid 
##        NA 0.9646697 0.7452229

While the training set fits better, we’ve mostly just helped to overfit the original data. In many cases, however, bigrams and trigrams are quite powerful. We’ll see over the next few classes exactly when and where each of these is most useful.

Character grams

Another way to split apart the text is to break it into individual characters. This would have helped to find the exclamation mark and pound symbol in the spam example, for instance. Individual characters can be put together in much the same way as bigrams and trigrams. These are often called character shingles. We can get them in R by using the tokenize_character_shingles function. Here, we’ll get all shingles from 1 to 3 characters wide. I like to include the non-alphanumeric (i.e., numbers and letters) characters, but feel free to experiment with excluding them.

token_list <- tokenize_character_shingles(president$text, n_min = 1, n = 3,
                                          strip_non_alphanum = FALSE)
token_df <- term_list_to_df(token_list)
X <- term_df_to_matrix(token_df, min_df = 0.01, max_df = 0.90,
                       scale = TRUE)

y <- president$class
X_train <- X[president$train_id == "train",]
X_valid <- X[president$train_id == "valid",]
y_train <- y[president$train_id == "train"]
y_valid <- y[president$train_id == "valid"]
dim(X)
## [1] 1200 3482

The number of columns is almost twice as large as the bigrams model. We can plug it into the elastic net once more:

library(glmnet)
model <- cv.glmnet(X_train, y_train, nfolds = 5, family = "multinomial",
                   alpha = 0.8)

The coefficents are not quite as interesting here, however, as it is hard to figure out exactly what each feature is picking up:

beta <- coef(model, s = model$lambda[15])
beta <- Reduce(cbind, beta)
beta[apply(beta != 0, 1, any),]
## 43 x 3 sparse Matrix of class "dgCMatrix"
##                       1           1            1
## (Intercept)  -0.2563556   0.2483961  0.007959487
## ne            .           .         -2.555076638
## tha          26.3806689   .          .          
## hat          38.2825082 -54.5833445  .          
## st            .           .         23.341560233
## '            76.4900336   .          .          
## t t           .           .         21.882873554
## w             .          -2.2304785  .          
## bu            .         -32.7861732  .          
## al          -14.2996064   .          .          
## ut            .         -50.6903679  .          
##  se           .           8.5197782  .          
## s w          16.4298583   .          .          
##  i            .           .         13.535832897
## t w           .          -7.7865371  .          
## mu           -5.5779028   .          .          
## ple           .           .         50.522292029
## ob            .         -42.8669618  .          
## ted           .          18.8776764  .          
##  mu         -46.5316779   .          .          
## t'            0.9864421   .          .          
## t's           1.3328920   .          .          
## job          79.3538678   .          .          
## , b           5.6066240   .          .          
## bs           13.7581395   .          .          
## las           .         -29.7132786  .          
## k t           .           .          7.661043224
## err           .         203.4152889  .          
## gen           .           .         -0.078094609
## ira           .         158.9916717  .          
## wel           .           .          5.979716814
## add           .         221.1512340  .          
## oci           .         112.6340362  .          
## soc           .          32.6418954  .          
## ink           .           .         73.354222168
## arm           .          12.7024326  .          
## lfa           .           .         75.367948337
## . y           .          81.6842917  .          
## yet           .          59.2138603  .          
## --          279.6770096   .          .          
## ogi           .         120.1951763  .          
## auc           .           .         75.436026292
## ucr           .           .         52.980501631

Predicting on the data we see that the model performs very well even though we might not understand it. It is not quite as good as the bigram model, however.

president$class_pred <- predict(model, newx = X, type = "class")
tapply(president$class_pred == president$class,
       president$train_id, mean)
##      test     train     valid 
##        NA 0.9431644 0.7261146

In the lab for next class you’ll be given a version of this dataset that contains 5 (different) presidents. Try to experiment with this automatic functions for extracting features and creating data matrices from text.