library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)
library(glmnet)
library(Matrix)
library(magrittr)
library(stringi)

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(sparse.colnames = TRUE)
options(width = 77L)

Spam Data

Today, we are going to look at our first two data sets containing textual data. For both, the classification task will be to distinguish “spam” from “ham”. In the notes we will look at an older data set of text messages from the UK. The lab looks at a collection of email messages. Please note that both contain a relatively large amount of inappropriate language, though nothing that you would not expect to find in invasive spam messages.

Data set

To start, we will read the spam data into R and create a training and validation split of the data.

set.seed(1)

spam <- read_csv("data/spam.csv") %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
spam
## # A tibble: 1,276 x 4
##    doc_id   class text                                               train_id
##    <chr>    <dbl> <chr>                                              <chr>   
##  1 doc00001     0 Ok not a problem will get them a taxi. C ing  tom… valid   
##  2 doc00002     1 Free Top ringtone -sub to weekly ringtone-get 1st… valid   
##  3 doc00003     1 PRIVATE! Your 2003 Account Statement for 07753741… train   
##  4 doc00004     0 But you dint in touch with me.                     train   
##  5 doc00005     1 FREE>Ringtone! Reply REAL or POLY eg REAL1 1. Pus… train   
##  6 doc00006     0 Come to me, slave. Your doing it again ... Going … valid   
##  7 doc00007     1 YOU ARE CHOSEN TO RECEIVE A £350 AWARD! Pls call … train   
##  8 doc00008     0 ER, ENJOYIN INDIANS AT THE MO..yeP. SaLL gOoD Heh… valid   
##  9 doc00009     1 Our records indicate u maybe entitled to 5000 pou… train   
## 10 doc00010     1 Last Chance! Claim ur £150 worth of discount vouc… valid   
## # … with 1,266 more rows

All of the interesting features that we can use to detect spam are contained in the variable text, which consists of the message itself. Let’s look at a few “ham” messages; I will use the use_series function in order to display the full text in the notebook:

spam %>%
  filter(class == 0) %>%
  sample_n(10) %>%
  use_series(text)
##  [1] "Okay... We wait ah"                                                                                                                                                             
##  [2] "Its a part of checking IQ"                                                                                                                                                      
##  [3] "Goodmorning, today i am late for 2hrs. Because of back pain."                                                                                                                   
##  [4] "Babe, I need your advice"                                                                                                                                                       
##  [5] "Home so we can always chat"                                                                                                                                                     
##  [6] "Nt only for driving even for many reasons she is called BBD..thts it chikku, then hw abt dvg cold..heard tht vinobanagar violence hw is the condition..and hw ru ? Any problem?"
##  [7] "Takin a shower now but yeah I'll leave when I'm done"                                                                                                                           
##  [8] "Garbage bags, eggs, jam, bread, hannaford wheat chex"                                                                                                                           
##  [9] "Great. I was getting worried about you. Just know that a wonderful and caring person like you will have only the best in life. Know that u r wonderful and God's love is yours."
## [10] "I don't want you to leave. But i'm barely doing what i can to stay sane. fighting with you constantly isn't helping."

And similarly, here are a random sample of spam messages:

spam %>%
  filter(class == 1) %>%
  sample_n(10) %>%
  use_series(text)
##  [1] "Welcome to Select, an O2 service with added benefits. You can now call our specially trained advisors FREE from your mobile by dialling 402."                     
##  [2] "Dorothy@kiefer.com (Bank of Granite issues Strong-Buy) EXPLOSIVE PICK FOR OUR MEMBERS *****UP OVER 300% *********** Nasdaq Symbol CDGT That is a $5.00 per.."     
##  [3] "Ur cash-balance is currently 500 pounds - to maximize ur cash-in now send CASH to 86688 only 150p/msg. CC: 08718720201 PO BOX 114/14 TCR/W1"                      
##  [4] "Freemsg: 1-month unlimited free calls! Activate SmartCall Txt: CALL to No: 68866. Subscriptn3gbp/wk unlimited calls Help: 08448714184 Stop?txt stop landlineonly" 
##  [5] "Congrats! 2 mobile 3G Videophones R yours. call 09061744553 now! videochat wid ur mates, play java games, Dload polyH music, noline rentl. bx420. ip4. 5we. 150pm"
##  [6] "Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net"        
##  [7] "URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid12hrs only"     
##  [8] "Hello darling how are you today? I would love to have a chat, why dont you tell me what you look like and what you are in to sexy?"                               
##  [9] "Urgent! call 09066350750 from your landline. Your complimentary 4* Ibiza Holiday or 10,000 cash await collection SAE T&Cs PO BOX 434 SK3 8WP 150 ppm 18+"         
## [10] "Do you want a new video handset? 750 anytime any network mins? Half Price Line Rental? Camcorder? Reply or call 08000930705 for delivery tomorrow"

Would you be able to classify these messages manually? Without worrying about the specifics, how would you do that?

Hand-constructing features

In order to use any of the linear models that we have seen so far, we need to create numeric predictor variables (“features”) from the data set. Often this has been only an issue of choosing which existing numeric variables to include in our model. Sometimes we needed to do a bit more work, such as using a polynomial expansion or indicator variables. When working with text, creating the feature variables is much more of an art-form. In fact, it is what we will spend much of the rest of the semester focused on.

To start, let’s create variables that count the occurrence of some characters that might be associated with spam. We will use the function stri_count to count the number of times the exclamantion mark, the question mark, and the pound (currency) symbol occur. We can also use stri_length to grab the entire length of the message. Here are what the features look like in the data:

spam %>%
  mutate(
    length = stri_length(text),
    num_exclam = stri_count(text, fixed = "!"),
    num_quest = stri_count(text, fixed = "?"),
    num_gbp = stri_count(text, fixed = "£")
  ) %>%
  select(class, length, num_exclam, num_quest, num_gbp)
## # A tibble: 1,276 x 5
##    class length num_exclam num_quest num_gbp
##    <dbl>  <int>      <int>     <int>   <int>
##  1     0    114          0         0       0
##  2     1    114          0         1       0
##  3     1    148          1         0       0
##  4     0     30          0         0       0
##  5     1    157          1         0       1
##  6     0    128          0         0       0
##  7     1    158          1         0       1
##  8     0     80          0         1       0
##  9     1    156          0         0       0
## 10     1    152          4         0       2
## # … with 1,266 more rows

Now that we have these features, let’s build a logistic regression predicting whether a message is spam with them.

model <- spam %>%
  mutate(
    length = stri_length(text),
    num_exclam = stri_count(text, fixed = "!"),
    num_quest = stri_count(text, fixed = "?"),
    num_gbp = stri_count(text, fixed = "£")
  ) %>%
  filter(train_id == "train") %>%
  glm(
    class ~ length + num_exclam + num_quest + num_gbp,
    data = .,
    family = binomial
  )

summary(model)
## 
## Call:
## glm(formula = class ~ length + num_exclam + num_quest + num_gbp, 
##     family = binomial, data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.7294  -0.5245   0.9197   1.9605  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.009979   0.206096  -9.753  < 2e-16 ***
## length       0.013693   0.001822   7.515 5.69e-14 ***
## num_exclam   0.589993   0.141950   4.156 3.23e-05 ***
## num_quest   -0.129216   0.157061  -0.823    0.411    
## num_gbp      2.408347   0.396517   6.074 1.25e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1077.09  on 776  degrees of freedom
## Residual deviance:  780.36  on 772  degrees of freedom
## AIC: 790.36
## 
## Number of Fisher Scoring iterations: 7

We see that exclamation marks and pound signs are both associated with spam. Longer messages also tend to be spam. Question marks are negatively associated with spam, though the coefficent is not significant. How well does the model do predicting spam:

spam %>%
  mutate(
    length = stri_length(text),
    num_exclam = stri_count(text, fixed = "!"),
    num_quest = stri_count(text, fixed = "?"),
    num_gbp = stri_count(text, fixed = "£")
  ) %>%
  mutate(pred = predict(model, newdata = ., type = "response")) %>%
  mutate(class_pred = (pred > 0.5)) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(class_pred == class))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.819
## 2 valid         0.810

It guesses correctly about 80% of the time. The data set was constructed to have exactly 50% spam and 50% ham, so a rate of 80% based on just four features is actually quite good!

Tokens and lemmas

We could continue to iteratively guess features to create from our data set in order to improve the model. However, it will be much better to try to find a way to systematically find the best features that predict the class of the message. To do this, we will use the cleanNLP package, which will help us work with our textual data. We start by loading the package and initalizing the stringi backend (more on this next time).

library(cleanNLP)

cnlp_init_stringi()

Now, we call the function cnlp_annotate() on the spam data and save the result (a data frame) as a new variable.

token <- cnlp_annotate(spam, verbose = FALSE)$token
token
## # A tibble: 31,888 x 6
##    doc_id     sid   tid token   lemma   upos 
##  * <chr>    <int> <int> <chr>   <chr>   <chr>
##  1 doc00001     1     1 Ok      ok      X    
##  2 doc00001     1     2 not     not     X    
##  3 doc00001     1     3 a       a       X    
##  4 doc00001     1     4 problem problem X    
##  5 doc00001     1     5 will    will    X    
##  6 doc00001     1     6 get     get     X    
##  7 doc00001     1     7 them    them    X    
##  8 doc00001     1     8 a       a       X    
##  9 doc00001     1     9 taxi    taxi    X    
## 10 doc00001     1    10 .       .       PUNCT
## # … with 31,878 more rows

The annotation function creates a data frame with one row for each token (words and punctuation marks) in the data set. It has many more rows than the input dataset. Each token includes an indicator of which document it came from, as well as counters for the sentence within the document and the token within the sentence. There is also a column called lemma that contains a version of the token in lower case letters and a column called upos that indicates if a token is a word (“X”) or punctuation mark (“PUNCT”).

Let’s see some of the most frequent word lemmas in the data:

token %>%
  filter(upos == "X") %>%
  group_by(lemma) %>%
  summarize(sm_count()) %>%
  arrange(desc(count))
## # A tibble: 4,483 x 2
##    lemma count
##    <chr> <int>
##  1 to      838
##  2 you     519
##  3 a       466
##  4 i       359
##  5 call    328
##  6 /       322
##  7 the     317
##  8 u       293
##  9 &       289
## 10 £       288
## # … with 4,473 more rows

These are all fairly common words or symbols that we would expect to occur frequently in text messages.

Term frequency (TF) matrix

Now, we want to create a model matrix from the detected lemmas. Unlike our first attempt that only included a few terms that we manually selected, here we will create counts for all of the terms that occur in the data. To do this, we use the function cnlp_utils_tf. It returns a matrix object with one row for each document and only column for each unique lemma in the data. We pass the set of documents in order to make sure that the rows of X line up with the rows of spam (there can be issues, for example, if we filter the tokens in a way that causes a document to have no tokens at all).

X <- cnlp_utils_tf(token, doc_set = spam$doc_id)
dim(X)
## [1] 1276 4489

Here, we have 1276 rows (the same as the spam data) and 4489 columns (one for each unique token). The matrix X is called a term frequency matrix; it provides the counts of how often each term occurs in each document. To understand the matrix, let’s look at the first few rows and columns:

X[1:10, 1:24]
## 10 x 24 sparse Matrix of class "dgCMatrix"
##          . to ! you a , call the ? i your £ u for now is and / : or - & in 2
## doc00001 3  1 .   . 2 .    .   1 . .    . . .   .   .  .   1 . .  . . .  . .
## doc00002 .  2 .   . . .    .   . 1 .    . . .   .   .  .   . . .  . 6 .  . .
## doc00003 4  . 1   . . .    1   . . 1    1 . .   1   .  .   . 2 1  . 1 .  . .
## doc00004 1  . .   1 . .    .   . . .    . . .   .   .  .   . . .  . . .  1 .
## doc00005 5  . 1   . . .    .   . . .    . 1 1   1   .  .   1 1 .  1 . .  . 1
## doc00006 7  1 .   1 . 1    .   . . .    2 . .   .   .  .   1 . 1  . 1 .  . .
## doc00007 1  3 1   2 2 .    1   . . .    1 1 .   .   .  .   . . .  . . .  . .
## doc00008 3  . .   . . 1    .   1 1 .    . . 1   .   .  .   . . .  . . .  . .
## doc00009 2  3 .   1 . .    .   1 . .    . . 1   1   .  .   . . .  . . .  1 1
## doc00010 2  1 4   . . 2    .   . . .    . 2 .   .   1  .   . . .  . . .  . .

The matrix here is slightly different than those that we build with the model.matrix function; it is a sparse matrix, a special way of storing a matrix that has a lot of zeros. The zeros as given as dots in the print out. Reading the matrix, we see that the first two documents do not use exclamation marks, but the third one does. The first document uses three periods, the second uses none, and the third uses four.

We can create a training data from the term frequency matrix using the same syntax as with a dense matrix. We will create the training response by directly grabbing the variable class from the spam data.

X_train <- X[spam$train_id == "train", ]
y_train <- spam$class[spam$train_id == "train"]

And now we can use this data to build a model.

Penalized regression for text classification

Our model matrix has a very large number of columns. It has, in fact, more columns that rows! It is not possible to use a standard linear or logistic regression model. However, penalized regression is perfect: it will automatically select the best variables to use in the model.

The glmnet package is able to work directly to sparse matrices, so we can run the model just as we did in the previous notebook.

model <- cv.glmnet(
  X_train, y_train, alpha = 0.2, family = "multinomial", nfolds = 3
)

How well does the model fit the data? It perfectly fits the training data and does much better than the previous model with the validation data:

spam %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(class == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         1    
## 2 valid         0.942

Perhaps more importantly for us, we can use the selected coefficients to find those terms most associated with spam and ham:

temp <- coef(model, s = model$lambda[16])
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 10 x 2 sparse Matrix of class "dgCMatrix"
##                        0            1
## (Intercept)  0.131791062 -0.131791062
## to          -0.040254302  0.040254302
## !           -0.016507233  0.016507233
## call        -0.242306061  0.242306061
## i            0.039916172 -0.039916172
## £           -0.095063299  0.095063299
## /           -0.039815380  0.039815380
## or          -0.010101545  0.010101545
## free        -0.009491646  0.009491646
## txt         -0.054947145  0.054947145

There is only one lemma associated with ham: “i” (the personal pronoun I). The other terms are all associated with spam. Some seem intuitive, such as “£” and “free”, others are at first a bit less clear. We will see in the final sections several different ways of investigating the model and what it tells us about our data.

Key Words in Context

The key word in context (KWiC) method is used in corpus linguistics to understand how certain terms are used in a corpus. This can be useful for understanding why certain terms pop up in a predictive model. To run a KWiC analysis we will use the sm_kwic function. Let’s try to figure out why the term “to” is associated with spam:

sm_kwic("to", spam$text, n = 20)
##  [1] "down things you want| to |remember later."     
##  [2] " Txt the word: CLAIM| to |No: 81010 T&C www.db"
##  [3] "ar U've been invited| to |XCHAT. This is our f"
##  [4] "u have been selected| to |receivea £900 prize "
##  [5] "                  Up| to |ü... Ü wan come then"
##  [6] "gine you finally get| to |sink into that bath "
##  [7] "k, just text SUBPOLY| to |81618, £3 per pole. "
##  [8] "              I want| to |send something that "
##  [9] "am really horny want| to |chat or see me naked"
## [10] "CTION!Txt word:START| to |No:81151 & get Yours"
## [11] "aybe even £1000 cash| to |claim ur award call "
## [12] "eekly draw txt MUSIC| to |87066 TnC www.Ldew.c"
## [13] "OUR AREA. Reply DATE| to |start now! 18 only w"
## [14] "u up at about 5.15pm| to |go to taunton if you"
## [15] " (flights inc) speak| to |a live operator 2 cl"
## [16] "OW Txt the word DRAW| to |87066 TsCs www.ldew."
## [17] "  Give her something| to |drink, if she takes "
## [18] "           Sac needs| to |carry on:)"          
## [19] ".you can use that os| to |copy the important f"
## [20] " Will Smith-Switch..| To |order follow instruc"

And we see that it is often used to describe where some message should be sent to. We can do the same with the term “or”:

sm_kwic("or", spam$text, n = 20)
##  [1] "40GB iPod MP3 player| or |a £500 prize! Txt wo"
##  [2] "uaranteed £1000 CASH| or |£2000 gift. Speak to"
##  [3] "osta Del Sol Holiday| or |£5000 await collecti"
##  [4] ")as usual vijay film| or |its different?"      
##  [5] " TheMob WAP Bookmark| or |text WAP to 82468"   
##  [6] "test Nokia 8800, PSP| or |£250 cash every wk.T"
##  [7] " u wan 2 watch today| or |tmr lor."            
##  [8] "2 receive £1000 cash| or |a 4* holiday (flight"
##  [9] "e day I can forget u| Or |The day u realize th"
## [10] "ntal Camcorder Reply| or |call 08000930705"    
## [11] "XT? Camcorder? Reply| or |call 08000930705 NOW"
## [12] "uaranteed £1000 cash| or |£5000 prize!"        
## [13] "osta Del Sol Holiday| or |£5000 await collecti"
## [14] "nc Trav, Aco&Entry41| Or |£1000. To claim txt "
## [15] "home can do my stuff| or |watch tv wat."       
## [16] "W1J6HL LDN. 18 years| or |over."               
## [17] "the next txt message| or |click here>> http://"
## [18] "eUpd8 on 08000839402| or |call2optout/!YHL"    
## [19] "y wanna c my pic?Txt| or |reply DATE to 82242 "
## [20] " 4* Tenerife Holiday| or |£10,000 cash await c"

And we see that it is used in a few different common constructions such as “or stop” or to describe different types of prizes available in a contest.