Email Spam

In this lab we will look at a different spam data set, this time using email rather than SMS records. Note that we do not want to print out the entire data set because RStudio has strange trouble printing large data sets that contain lengthy text fields.

set.seed(1)

email <- read_csv(file.path("data", "spam_email.csv")) %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
head(email)
## # A tibble: 6 x 4
##   doc_id   class text                                                   train_id
##   <chr>    <dbl> <chr>                                                  <chr>   
## 1 doc00001     1 "This message is in MIME format. Since your mail read… valid   
## 2 doc00002     0 "URL: http://boingboing.net/#85511996 Date: Not suppl… valid   
## 3 doc00003     0 "Vernon,  I'm changing the instructions in the SpamAs… train   
## 4 doc00004     0 "URL: http://www.newsisfree.com/click/-3,8456083,215/… train   
## 5 doc00005     1 "Astounding New Software Lets You Find Out Almost Any… train   
## 6 doc00006     1 "Accept Credit Cards - Everyone Approved NO CREDIT CH… valid

Building Manual Features

To start, let’s try to build manual features to predict whether a message is spam. Build a logistic regression model using the length of the message and 3-5 hand constructed features (specific punctuation marks or words). Print out a summary of the model.

model <- email %>%
  mutate(
    length = stri_length(text),
    num_exclam = stri_count(text, fixed = "!"),
    num_quest = stri_count(text, fixed = "?"),
    num_commas = stri_count(text, fixed = ",")
  ) %>%
  filter(train_id == "train") %>%
  glm(
    class ~ length + num_exclam + num_quest + num_commas,
    data = .,
    family = binomial()
  )

summary(model)
## 
## Call:
## glm(formula = class ~ length + num_exclam + num_quest + num_commas, 
##     family = binomial(), data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8018  -0.8535  -0.1636   0.8623   3.2983  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.690e-01  8.058e-02  -9.543  < 2e-16 ***
## length       3.904e-04  5.349e-05   7.298 2.92e-13 ***
## num_exclam   5.485e-01  3.574e-02  15.347  < 2e-16 ***
## num_quest   -2.492e-01  3.323e-02  -7.499 6.44e-14 ***
## num_commas  -5.618e-02  7.671e-03  -7.324 2.41e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2554.9  on 1842  degrees of freedom
## Residual deviance: 1875.2  on 1838  degrees of freedom
## AIC: 1885.2
## 
## Number of Fisher Scoring iterations: 7

According to the model, all other things being equal, are longer messages more or less likely to be spam? Answer: In my model, longer messages are more likely to not be spam.

Now, compute the classification rate on the training and validation sets:

email %>%
  mutate(
    length = stri_length(text),
    num_exclam = stri_count(text, fixed = "!"),
    num_quest = stri_count(text, fixed = "?"),
    num_commas = stri_count(text, fixed = ",")
  ) %>%
  mutate(pred = predict(model, newdata = ., type = "response")) %>%
  mutate(class_pred = (pred > 0.5)) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(class_pred == class))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.768
## 2 valid         0.758

Take note of how this compares to the classification rate in the notes from today.

Building Automatic Features

Now, we will produce a tokenized version of the data set. The code below prints out progress every 500 documents. As with the email data set, we will try to avoid printing out the entire token table.

cnlp_init_stringi()
token <- cnlp_annotate(email, verbose = 500)$token
## Processed document 500 of 3000
## Processed document 1000 of 3000
## Processed document 1500 of 3000
## Processed document 2000 of 3000
## Processed document 2500 of 3000
## Processed document 3000 of 3000
head(token)
## # A tibble: 6 x 6
##   doc_id     sid   tid token   lemma   upos 
##   <chr>    <int> <int> <chr>   <chr>   <chr>
## 1 doc00001     1     1 This    this    X    
## 2 doc00001     1     2 message message X    
## 3 doc00001     1     3 is      is      X    
## 4 doc00001     1     4 in      in      X    
## 5 doc00001     1     5 MIME    mime    X    
## 6 doc00001     1     6 format  format  X

Determine the most common non-punctuation marks in the data set by counting the occurrences of every lemma and sorting them in descending order.

token %>%
  filter(upos == "X") %>%
  group_by(lemma) %>%
  summarize(sm_count()) %>%
  arrange(desc(count))
## # A tibble: 73,848 x 2
##    lemma count
##    <chr> <int>
##  1 -     90611
##  2 =     71367
##  3 /     41588
##  4 the   30012
##  5 to    23832
##  6 and   16901
##  7 >     16418
##  8 +     15857
##  9 of    15115
## 10 a     14803
## # … with 73,838 more rows

Do you notice that some punctuation marks are, in fact, in this data set? This is because the stringi parser is not very accurate. We will see a better one next class.

Now, building a TF matrix from the data set using the default parameters for the cnlp_utils_tf function. Also, create a training version of the matrix and a training response vector. Print out the dimension of the data matrix.

X <- cnlp_utils_tf(token, doc_set = email$doc_id)
X_train <- X[email$train_id == "train", ]
y_train <- email$class[email$train_id == "train"]
dim(X)
## [1]  3000 10000

How many features have been created? Answer: 10000

Now, create an elastic net model using three folds, alpha 0.9, and the TF matrix created above.

model <- cv.glmnet(
  X_train, y_train, alpha = 0.9, family = "multinomial", nfolds = 3
)

Using this model, compute the classification rate for the training and validation sets in the email data.

email %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(class == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.997
## 2 valid         0.967

How does this compare with the hand-constructed features? Answer: The model has an almost perfect fit for the training set and a very high accuracy for the validation data (nearly 97%).

Look at the coefficents from the model, selecting a lambda so that there are about twelve selected variables.

temp <- coef(model, s = model$lambda[18])
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 13 x 2 sparse Matrix of class "dgCMatrix"
##                        0            1
## (Intercept) -0.005333324  0.005333324
## your        -0.003911003  0.003911003
## !           -0.004777495  0.004777495
## we          -0.009029468  0.009029468
## our         -0.014677200  0.014677200
## listinfo     0.012844803 -0.012844803
## but          0.003336745 -0.003336745
## mailman      0.137568961 -0.137568961
## wrote        0.395720853 -0.395720853
## removed     -0.263411590  0.263411590
## remove      -0.067918425  0.067918425
## date         0.029991160 -0.029991160
## sightings   -0.008249627  0.008249627

Do these features and signs make sense / seem reasonable to you? Are any surprising? Answer: Answers vary.

Key Words in Context (KWiC)

In the four code blocks below, use the sm_kwic function to look at 20 occurrences of the more surprising/interesting terms you found above.

sm_kwic("mailman", email$text, n = 20)
##  [1] "/lists.freshrpms.net|/mailman/|listinfo/rpm-list"   
##  [2] "/lists.freshrpms.net|/mailman/|listinfo/rpm-list"   
##  [3] " http://www.linux.ie|/mailman/|listinfo/ilug for (u"
##  [4] " http://www.linux.ie|/mailman/|listinfo/ilug for (u"
##  [5] "/lists.freshrpms.net|/mailman/|listinfo/rpm-list"   
##  [6] "7--  http://xent.com|/mailman/|listinfo/fork"       
##  [7] "/lists.freshrpms.net|/mailman/|listinfo/rpm-list"   
##  [8] ".us  http://xent.com|/mailman/|listinfo/fork"       
##  [9] " http://www.linux.ie|/mailman/|listinfo/ilug for (u"
## [10] "//listman.redhat.com|/mailman/|listinfo/exmh-users" 
## [11] " http://www.linux.ie|/mailman/|listinfo/social for "
## [12] "es.  http://xent.com|/mailman/|listinfo/fork"       
## [13] "/lists.freshrpms.net|/mailman/|listinfo/rpm-list"   
## [14] "http://iiu.taint.org|/mailman/|listinfo/iiu"        
## [15] "om   http://xent.com|/mailman/|listinfo/fork"       
## [16] "/lists.freshrpms.net|/mailman/|listinfo/rpm-list"   
## [17] "/lists.freshrpms.net|/mailman/|listinfo/rpm-list"   
## [18] "//listman.redhat.com|/mailman/|listinfo/exmh-worker"
## [19] "//listman.redhat.com|/mailman/|listinfo/exmh-users "
## [20] "an.  http://xent.com|/mailman/|listinfo/fork"
sm_kwic("wrote", email$text, n = 20)
##  [1] "       Urban Boquist| wrote:| > If I run spamassa"  
##  [2] " \"Ted\" == Ted Cabeen| wrote:| > >  Ted> Here's th"
##  [3] "at, 27 Jul 2002, Tom| wrote:| >  > >  > > Off we "  
##  [4] "07 -0700 Ben Liblit | wrote:|  > Ick.  Perhaps th"  
##  [5] "        Wynne, Conor| wrote:| > Hi ladies, >  > I"  
##  [6] " +0100, Liam Bedford| wrote:| >fdisk /mbr will re"  
##  [7] "= Brent Welch  >>>>>| wrote |the following on Tue"  
##  [8] " at 19:26, Rick Moen| wrote:| > >> NON-freely red"  
##  [9] "+0100, Padraig Brady| wrote:| > > On Fri, Aug 16,"  
## [10] "tbitch@magnesium.net| wrote:| > > > > Hit or miss"  
## [11] "Gary Lawrence Murphy| wrote |>Wasn't the Aztec po"  
## [12] "01:58, Michael Conry| wrote:| > Hi all, > I've ru"  
## [13] "-07-22 at 06:20, che| wrote:|  > The server menti"  
## [14] "             SoloCDM| wrote:| > David Neary state"  
## [15] "Thu, 1 Aug 2002, Tom| wrote:|  >  > I gota thank "  
## [16] " >  > Adam L. Beberg| wrote:| >  >>> Forwarding m"  
## [17] " -0500, Brian French| wrote:| > hey i have a prob"  
## [18] "-0400  Tom Reingold | wrote:|  > Years ago, there"  
## [19] "PM +0200, Axel Thimm| wrote:| > On Thu, Sep 05, 2"  
## [20] "ng: 7bit  Owen Byrne| wrote:|  > R. A. Hettinga w"
sm_kwic("removed", email$text, n = 20)
##  [1] "(949) 218-6189 to be| removed |from = future mailin"
##  [2] "ou would like to be | removed | from this list send"
##  [3] "     HERE&nbsp;to be| removed |from the list. Pleas"
##  [4] "professionals. To be| removed |from this mailing li"
##  [5] "            What    ||Removed |                    "
##  [6] "       request to be| removed |by clicking HERE    "
##  [7] "uld would like to be| removed |from our list, pleas"
##  [8] " contact you.  To be| removed |from our link simple"
##  [9] " 7        days to be| removed |and send ALL address"
## [10] "e allow 7 days to be| removed |and send ALL add= re"
## [11] "netcommission. To Be| Removed |From Our List, http:"
## [12] "ah.com:27000   To be| removed |from our list simply"
## [13] "error, or wish to be| removed |from our subscribe= "
## [14] "ntended.       To be| removed |from our distributio"
## [15] "e used again.  To be| removed |from the mailing lis"
## [16] "   =20         To be| removed |from this list click"
## [17] "e              to be| removed |from our email list."
## [18] "    Click Here to be| removed,|=  and you will *nev"
## [19] "professionals. To be| removed |from this mailing li"
## [20] "               To Be| Removed |From Our Lis= t,  CL"
sm_kwic("date", email$text, n = 20)
##  [1] "ick/-2,8443955,1440/| Date:| Not supplied  A new" 
##  [2] "usiness for Grownups| Date:| Sun, 21 Jan 2001 09" 
##  [3] "spam corpus. >> >>To| date,| the scores evolved " 
##  [4] "  Anders Eriksson  >| Date:|  Mon, 19 Aug 2002 2" 
##  [5] " each (our choice of| date,| but we = will pick " 
##  [6] "c27a6995e30fc10b6482| Date:| Not supplied  It ca" 
##  [7] "lick/-0,8357899,215/| Date:| 2002-09-29T00:39:55" 
##  [8] "lick/-0,8597657,215/| Date:| 2002-10-06T02:27:58" 
##  [9] "HOURS       FROM THE| DATE |OF DELIVERY OF THIS " 
## [10] "rt Elz wrote:  >    | Date:|        Wed, 11 Sep " 
## [11] "ick/-1,8643939,1440/| Date:| Not supplied  World" 
## [12] "com for a more up to| date |online copy of the p" 
## [13] "ick/-1,8396718,1717/| Date:| 2002-09-30T15:14:16" 
## [14] ">by both last change| date |and alphabetically f" 
## [15] "com/click/215,3,215/| Date:| 2002-10-03T04:21:04" 
## [16] "ick/-0,8132228,1440/| Date:| Not supplied  The c" 
## [17] "n.taint.org Delivery|-Date:| Tue, 02 Oct 2001 15" 
## [18] "sw.com/weblog/000614| Date:| 2002-09-24T11:03:09" 
## [19] "J. W. Ballantine\"  >| Date:|  Wed, 21 Aug 2002 0"
## [20] "ick/-3,8473772,1440/| Date:| Not supplied  And s"

Does the KWiC method help explain why some of these are so predictive and/or why they are associated with a particular category? Answer: Answers vary.