Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options message=FALSE and echo=FALSE to avoid cluttering your solutions with all the output from this code.

Reading the Data

To start, let’s read in some text data to study for today. The dataset for this notebook consists of some spam text messages.

docs <- read_csv("../data/spam.csv")
anno <- read_csv("../data/spam_token.csv.gz")

The data is a bit old (I think around 2005) and from the UK. It is fairly easy to classify this dataset, so I think it is a good place to start.

Questions

In the code blocks below, you should type in the solution to each of the questions. Take your time with this; the code is not complex. The real point is to understand each of the steps and what we learn about the data from the model. There are also some short answers that you can fill in after the phrase “Answer”.

Start by creating an object called model that builds an elastic net to predict whether a message is spam or not. Use all of the default values.

# Question 01
model <- dsst_enet_build(anno, docs)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead

Now, compute the error rate for the training and validation sets.

# Question 02
model$docs %>%
  group_by(train_id) %>%
  summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
##   train_id   erate
##   <chr>      <dbl>
## 1 train    0.00525
## 2 valid    0.0778

Does the model do better on the training or the validation data? Answer: Better on the training data.

In the next code block, compute the segmented error rate.

# Question 03
model$docs %>%
  group_by(train_id, label) %>%
  summarize(erate = mean(label != pred_label))
## # A tibble: 4 × 3
## # Groups:   train_id [2]
##   train_id label  erate
##   <chr>    <chr>  <dbl>
## 1 train    ham   0     
## 2 train    spam  0.0103
## 3 valid    ham   0.0491
## 4 valid    spam  0.108

And next, compute the confusion matrix:

# Question 04
model$docs %>%
  select(label, pred_label, train_id) %>%
  table()
## , , train_id = train
## 
##       pred_label
## label  ham spam
##   ham  373    0
##   spam   4  385
## 
## , , train_id = valid
## 
##       pred_label
## label  ham spam
##   ham  252   13
##   spam  27  222

Is the model more likely to think that spam is ham, or ham is spam? Answer: My model was more likely to think that spam was ham.

In the next step, compute the coefficients of the model. Note that the model has fit a multinomial elastic net, but with only two classes. You should see that there are two columns with opposite signs. See question 4 on the first handout for an explanation of why this happens.

# Question 05
dsst_coef(model$model)
## 204 x 3 sparse Matrix of class "dgCMatrix"
##                                    ham          spam MLN
## (Intercept)               1.0084374502 -1.0084374502   .
## call                     -0.8231671885  0.8231671885   2
## £                        -0.7522423777  0.7522423777   5
## to                       -0.3080336861  0.3080336861  11
## !                        -0.3109799273  0.3109799273  14
## or                       -0.3646170259  0.3646170259  17
## txt                      -0.5252800368  0.5252800368  19
## free                     -0.3757629076  0.3757629076  19
## /                        -0.4221107055  0.4221107055  21
## ...                       0.1756372131 -0.1756372131  22
## -PRON-                    0.1081410116 -0.1081410116  25
## 150p                     -0.5152989181  0.5152989181  26
## text                     -0.5316047171  0.5316047171  27
## not                       0.1919928919 -0.1919928919  29
## i                         0.1591052833 -0.1591052833  31
## -                        -0.1112908507  0.1112908507  31
## service                  -0.5812371387  0.5812371387  32
## come                      0.3253504695 -0.3253504695  32
## new                      -0.9248682349  0.9248682349  33
## for                      -0.3157157047  0.3157157047  33
## ..                        0.2140391762 -0.2140391762  36
## a                        -0.1884212035  0.1884212035  40
## CALL                     -0.9403309695  0.9403309695  41
## from                     -0.1695438209  0.1695438209  43
## claim                    -0.4201778465  0.4201778465  45
## but                       0.1955530241 -0.1955530241  46
## ringtone                 -0.8026865693  0.8026865693  47
## per                      -0.2848580861  0.2848580861  47
## ok                        0.2761713748 -0.2761713748  47
## now                      -0.1569656213  0.1569656213  48
## contact                  -0.0803335088  0.0803335088  49
## later                     0.3581946904 -0.3581946904  50
## STOP                     -0.1810376489  0.1810376489  50
## :                        -0.1528404972  0.1528404972  50
## 2                        -0.1095949195  0.1095949195  50
## content                  -0.7877948922  0.7877948922  51
## message                  -0.3211211162  0.3211211162  52
## Text                     -0.2798553096  0.2798553096  55
## charge                   -0.1961128551  0.1961128551  55
## pc                       -1.5561457273  1.5561457273  56
## Txt                      -0.3755561087  0.3755561087  56
## ac                        2.8108694570 -2.8108694570  57
## happen                    1.8019297596 -1.8019297596  57
## real                     -1.2451931572  1.2451931572  57
## ago                       1.1183056337 -1.1183056337  57
## BABE                      0.7378975738 -0.7378975738  57
## CARE                      0.5291430117 -0.5291430117  57
## END                       0.3956857568 -0.3956857568  57
## fault                     0.2957462381 -0.2957462381  57
## model                     0.2519324228 -0.2519324228  57
## quit                      0.1690944459 -0.1690944459  57
## XXXX                      0.1585458243 -0.1585458243  57
## that                      0.1306502294 -0.1306502294  57
## sexy                     -0.4392750068  0.4392750068  58
## ;                         0.0358650206 -0.0358650206  58
## 4                        -0.0198986612  0.0198986612  58
## credit                   -0.6799734578  0.6799734578  59
## user                     -0.6227279358  0.6227279358  59
## shortly                  -0.5452655200  0.5452655200  59
## urgent                   -0.1424160822  0.1424160822  59
## adult                    -0.5412390937  0.5412390937  60
## some1                     1.2035479211 -1.2035479211  61
## work                      0.2273503279 -0.2273503279  61
## filthy                   -0.9747040004  0.9747040004  62
## Fri                      -0.8570074734  0.8570074734  62
## hi                       -0.2273516161  0.2273516161  62
## further                  -0.9146856144  0.9146856144  63
## 88066                    -0.5213212130  0.5213212130  63
## Calls£1                  -0.4340847396  0.4340847396  63
## landline                 -0.4040765880  0.4040765880  63
## HELP                     -0.3306199544  0.3306199544  63
## 18                       -0.2410153124  0.2410153124  63
## wait                     -0.1639412186  0.1639412186  63
## than                     -0.8568495084  0.8568495084  64
## an                       -0.1423287967  0.1423287967  64
## C                        -0.4697877313  0.4697877313  65
## apply                    -0.4443067985  0.4443067985  65
## who                      -0.2682150770  0.2682150770  65
## alone                    -0.1720596580  0.1720596580  65
## win                      -0.1661321280  0.1661321280  65
## do                        0.0475797233 -0.0475797233  65
## email                     0.4853820833 -0.4853820833  66
## FreeMsg                  -0.1278164029  0.1278164029  66
## smoke                     0.1231981109 -0.1231981109  66
## sure                      0.0842850270 -0.0842850270  66
## something                 0.2976554764 -0.2976554764  67
## eg                       -0.2903165256  0.2903165256  67
## because                   0.2987714342 -0.2987714342  68
## private                  -0.0948374504  0.0948374504  68
## August                   -0.7303431714  0.7303431714  69
## amazing                  -0.4148065712  0.4148065712  69
## blow                     -0.2354224462  0.2354224462  69
## truly                    -0.0986818735  0.0986818735  69
## 18p                      -0.0343212453  0.0343212453  69
## G                        -0.0156083207  0.0156083207  69
## incredible               -0.0056723872  0.0056723872  69
## Txts                     -0.0029722029  0.0029722029  69
## learn                    -0.0018284846  0.0018284846  69
## quality                  -0.8901071966  0.8901071966  70
## darle                    -0.7950218177  0.7950218177  70
## 2003                     -0.4854072660  0.4854072660  70
## unique                   -0.4702816606  0.4702816606  70
## Ltd                      -0.3882975934  0.3882975934  70
## yeah                      0.1172703272 -0.1172703272  70
## UK                       -0.0168007956  0.0168007956  70
## web                      -0.7055803014  0.7055803014  71
## Saturday                 -0.4135934388  0.4135934388  71
## Account                  -0.1060924471  0.1060924471  71
## at                        0.0804059794 -0.0804059794  71
## Statement                -0.0055682486  0.0055682486  71
## hav                       0.4049005596 -0.4049005596  72
## (                        -0.2294620915  0.2294620915  72
## ask                       0.1610387828 -0.1610387828  72
## da                        0.0793355313 -0.0793355313  72
## -message                 -0.3701294005  0.3701294005  73
## Reply                    -0.2047103851  0.2047103851  73
## +447797706009            -0.1564985707  0.1564985707  73
## chat                     -0.1200875849  0.1200875849  73
## Bloomberg                -0.0510654493  0.0510654493  73
## car                       0.0459093134 -0.0459093134  73
## bloomberg.com            -0.0131397142  0.0131397142  73
## http://career            -0.0057015856  0.0057015856  73
## evening                  -0.6069564283  0.6069564283  74
## bird                     -0.7586561993  0.7586561993  75
## still                     0.1181729487 -0.1181729487  75
## O2FWD                    -0.0070181842  0.0070181842  75
## More                     -0.2711644094  0.2711644094  76
## R                        -0.2013060309  0.2013060309  76
## Break                    -0.1498053219  0.1498053219  77
## center                   -0.0032982197  0.0032982197  77
## widelive.com/index       -0.1427821271  0.1427821271  78
## REAL                     -0.7663308222  0.7663308222  79
## %                        -0.3023782616  0.3023782616  79
## 5.00                     -0.1979743129  0.1979743129  79
## 83222                    -0.1309555766  0.1309555766  79
## GET                      -0.0593884083  0.0593884083  79
## our                      -0.0530656747  0.0530656747  79
## Bank                     -0.0517554691  0.0517554691  79
## in                        0.0471452491 -0.0471452491  79
## CDGT                     -0.0106478395  0.0106478395  79
## Strong                   -0.0077021589  0.0077021589  79
## explosive                -0.0027567399  0.0027567399  79
## Granite                  -0.0027421313  0.0027421313  79
## Symbol                   -0.0010990120  0.0010990120  79
## Nasdaq                   -0.0004858133  0.0004858133  79
## Gr8                      -0.3533136700  0.3533136700  80
## ring                      0.1350919599 -0.1350919599  80
## press                    -0.1291756232  0.1291756232  80
## 0                        -0.0455911763  0.0455911763  80
## result                   -0.0116246297  0.0116246297  80
## Unsubscribe              -0.0103356007  0.0103356007  80
## EURO                     -0.0067350131  0.0067350131  80
## Arcade                   -0.0067061587  0.0067061587  80
## inform                   -0.0026194793  0.0026194793  80
## euro2004                 -0.0020511824  0.0020511824  80
## kickoff                  -0.0008069822  0.0008069822  80
## within                   -0.1448261898  0.1448261898  81
## boss                      0.1292309713 -0.1292309713  81
## Camera                   -0.0624063695  0.0624063695  81
## Digital                  -0.0192030565  0.0192030565  81
## SiPix                    -0.0067517645  0.0067517645  81
## student                   0.2160818661 -0.2160818661  82
## top                      -0.1625630261  0.1625630261  82
## smile                     0.0801039352 -0.0801039352  83
## news                     -0.0584177422  0.0584177422  83
## 62468                    -0.1330209665  0.1330209665  84
## FRND                     -0.0382388180  0.0382388180  84
## able                      0.1656260743 -0.1656260743  85
## part                     -0.1243590703  0.1243590703  85
## 8007                     -0.0770602477  0.0770602477  85
## turn                      0.0500851918 -0.0500851918  85
## when                      0.0229457381 -0.0229457381  85
## purchase                 -0.2185626139  0.2185626139  86
## log                      -0.1515582780  0.1515582780  86
## http://www.urawinner.com -0.0142195548  0.0142195548  86
## your                     -0.1028154054  0.1028154054  87
## shop                      0.0670019543 -0.0670019543  87
## Latest                   -0.0826997442  0.0826997442  89
## dad                       0.0819194698 -0.0819194698  89
## IM                        0.0810717166 -0.0810717166  89
## follow                   -0.0650273962  0.0650273962  89
## 2005                     -0.0951562895  0.0951562895  90
## POLY                     -0.0380859820  0.0380859820  90
## send                     -0.0182862498  0.0182862498  90
## mail                      0.0615440574 -0.0615440574  91
## News                     -0.1294544173  0.1294544173  92
## 542                      -0.0926237415  0.0926237415  92
## mrng                      0.0603571538 -0.0603571538  92
## :)                        0.0088514800 -0.0088514800  92
## hint                      0.0222340757 -0.0222340757  93
## file                      0.0164144008 -0.0164144008  93
## this                     -0.0110794815  0.0110794815  93
## then                      0.0003830135 -0.0003830135  93
## lady                     -0.0506076020  0.0506076020  94
## career                   -0.0258725840  0.0258725840  94
## hw                        0.0127901525 -0.0127901525  94
## star                     -0.0023686148  0.0023686148  94
## find                     -0.0007062404  0.0007062404  94
## partner                  -0.0130315931  0.0130315931  95
## cause                     0.0141779851 -0.0141779851  96
## voicemail                -0.0104162535  0.0104162535  96
## tv                       -0.0055463087  0.0055463087  96
## chart                    -0.0052122189  0.0052122189  96
## subscriber               -0.0005832920  0.0005832920  96

Look at the first few values in the table above. Do the strongest features that the model found make sense to you as an indicator or spam? Answer: I found the strongest features to be “call”, “£”, “to” and “!”. All the signs are positive for spam. I can certainly imagine all of these featuring in a spam text message.

Reproduce the coefficients, cutting off after a particular lambda number to get only a small number of results (about a dozen is good).

# Question 06
dsst_coef(model$model, lambda_num = 27)
## 13 x 3 sparse Matrix of class "dgCMatrix"
##                      ham         spam MLN
## (Intercept)  0.249848248 -0.249848248   .
## call        -0.408994453  0.408994453   2
## £           -0.243674928  0.243674928   5
## to          -0.099597910  0.099597910  11
## !           -0.076051815  0.076051815  14
## or          -0.084007961  0.084007961  17
## txt         -0.110410147  0.110410147  19
## free        -0.078472933  0.078472933  19
## /           -0.054961656  0.054961656  21
## ...          0.038612052 -0.038612052  22
## -PRON-       0.005298340 -0.005298340  25
## 150p        -0.022383909  0.022383909  26
## text        -0.003014019  0.003014019  27

Next, print out 10 messages that have the highest probability of being spam. Use the function dsst_print_text() to print out the messages and take a moment to read them.

# Question 07
model$docs %>%
  filter(pred_label == "spam") %>%
  arrange(desc(pred_value))  %>%
  slice_head(n = 10) %>%
  dsst_print_text()
## doc00202; spam; valid; spam; 0.999992138321575
## Ur ringtone service has changed! 25 Free credits! Go to
## club4mobiles.com to choose content now! Stop? txt CLUB STOP to 87070.
## 150p/wk Club4 PO Box1146 MK45 2WT
## 
## doc00855; spam; train; spam; 0.999982353729283
## Free-message: Jamster!Get the crazy frog sound now! For poly text
## MAD1, for real text MAD2 to 88888. 6 crazy sounds for just 3 GBP/week!
## 16+only! T&C's apply
## 
## doc00661; spam; valid; spam; 0.999889171471002
## Had your mobile 10 mths? Update to latest Orange camera/video phones
## for FREE. Save £s with Free texts/weekend calls. Text YES for a
## callback orno to opt out
## 
## doc00214; spam; valid; spam; 0.999833415323564
## January Male Sale! Hot Gay chat now cheaper, call 08709222922.
## National rate from 1.5p/min cheap to 7.8p/min peak! To stop texts call
## 08712460324 (10p/min)
## 
## doc00250; spam; valid; spam; 0.999652384688644
## Freemsg: 1-month unlimited free calls! Activate SmartCall Txt: CALL to
## No: 68866. Subscriptn3gbp/wk unlimited calls Help: 08448714184 Stop?
## txt stop landlineonly
## 
## doc00057; spam; train; spam; 0.999578061304966
## I want some cock! My hubby's away, I need a real man 2 satisfy me. Txt
## WIFE to 89938 for no strings action. (Txt STOP 2 end, txt rec £1.50ea.
## OTBox 731 LA1 7WS. )
## 
## doc00710; spam; train; spam; 0.999548735734572
## 3 FREE TAROT TEXTS! Find out about your love life now! TRY 3 FOR FREE!
## Text CHANCE to 85555 16 only! After 3 Free, Msgs £1.50 each
## 
## doc00830; spam; train; spam; 0.99949402694429
## Well done ENGLAND! Get the official poly ringtone or colour flag on
## yer mobile! text TONE or FLAG to 84199 NOW! Opt-out txt ENG STOP.
## Box39822 W111WX £1.50
## 
## doc01134; spam; train; spam; 0.99949402694429
## Well done ENGLAND! Get the official poly ringtone or colour flag on
## yer mobile! text TONE or FLAG to 84199 NOW! Opt-out txt ENG STOP.
## Box39822 W111WX £1.50
## 
## doc01099; spam; valid; spam; 0.99939969586009
## Great News! Call FREEFONE 08006344447 to claim your guaranteed £1000
## CASH or £2000 gift. Speak to a live operator NOW!

Repeat the last question, but select the ten messages most likely to be ham.

# Question 08
model$docs %>%
  filter(pred_label == "ham") %>%
  arrange(desc(pred_value))  %>%
  slice_head(n = 10) %>%
  dsst_print_text()
## doc00734; ham; train; ham; 0.999995428135907
## The last thing i ever wanted to do was hurt you. And i didn't think it
## would have. You'd laugh, be embarassed, delete the tag and keep going.
## But as far as i knew, it wasn't even up. The fact that you even felt
## like i would do it to hurt you shows you really don't know me at all.
## It was messy wednesday, but it wasn't bad. The problem i have with it
## is you HAVE the time to clean it, but you choose not to. You skype,
## you take pictures, you sleep, you want to go out. I don't mind a few
## things here
## 
## doc00226; spam; valid; ham; 0.999615059207897
## SMS. ac JSco: Energy is high, but u may not know where 2channel it.
## 2day ur leadership skills r strong. Psychic? Reply ANS w/question.
## End? Reply END JSCO
## 
## doc00025; spam; valid; ham; 0.998646488829747
## SMS. ac Sptv: The New Jersey Devils and the Detroit Red Wings play Ice
## Hockey. Correct or Incorrect? End? Reply END SPTV
## 
## doc00987; ham; train; ham; 0.998102983123443
## Wow. I never realized that you were so embarassed by your
## accomodations. I thought you liked it, since i was doing the best i
## could and you always seemed so happy about "the cave". I'm sorry I
## didn't and don't have more to give. I'm sorry i offered. I'm sorry
## your room was so embarassing.
## 
## doc00476; ham; valid; ham; 0.997655412518001
## Are you angry with me. What happen dear
## 
## doc01252; ham; valid; ham; 0.997655412518001
## I'm aight. Wat's happening on your side.
## 
## doc01035; ham; valid; ham; 0.995760396588148
## Probably money worries. Things are coming due and i have several
## outstanding invoices for work i did two and three months ago.
## 
## doc00574; ham; train; ham; 0.992649102839132
## Okay lor... Wah... like that def they wont let us go... Haha... What
## did they say in the terms and conditions?
## 
## doc00519; ham; valid; ham; 0.991784016399156
## Okay lor... Will they still let us go a not ah? Coz they will not know
## until later. We drop our cards into the box right?
## 
## doc00495; ham; train; ham; 0.991052631029984
## No dude, its not fake..my frnds got money, thts y i'm reffering u..if
## u member wit my mail link, u vl be credited &lt;#&gt; rs and il be
## getiing &lt;#&gt; rs..i can draw my acc wen it is &lt;#&gt; rs..

Next, print out ten random spam messages that were mis-classified.

# Question 09
model$docs %>%
  filter(label == "spam") %>%
  filter(label != pred_label) %>%
  slice_sample(n = 10) %>%
  dsst_print_text()
## doc00703; spam; valid; ham; 0.515106762048036
## Get 3 Lions England tone, reply lionm 4 mono or lionp 4 poly. 4 more
## go 2 www.ringtones.co.uk, the original n best. Tones 3GBP network
## operator rates apply.
## 
## doc00336; spam; valid; ham; 0.805331623254029
## A link to your picture has been sent. You can also use http://
## alto18.co.uk/wave/wave.asp?o=44345
## 
## doc00801; spam; valid; ham; 0.903187750837926
## Check Out Choose Your Babe Videos @ sms.shsex.netUN fgkslpoPW fgkslpo
## 
## doc00475; spam; train; ham; 0.5331847877818
## Your credits have been topped up for http://www.bubbletext.com Your
## renewal Pin is tgxxrz
## 
## doc01108; spam; valid; ham; 0.707242898252999
## tddnewsletter@emc1.co.uk (More games from TheDailyDraw) Dear Helen,
## Dozens of Free Games - with great prizesWith..
## 
## doc00887; spam; valid; ham; 0.705545996909835
## 3. You have received your mobile content. Enjoy
## 
## doc00617; spam; valid; ham; 0.515106762048036
## Get 3 Lions England tone, reply lionm 4 mono or lionp 4 poly. 4 more
## go 2 www.ringtones.co.uk, the original n best. Tones 3GBP network
## operator rates apply
## 
## doc01034; spam; train; ham; 0.882557480301154
## ringtoneking 84484
## 
## doc00066; spam; valid; ham; 0.821765526679288
## You will recieve your tone within the next 24hrs. For Terms and
## conditions please see Channel U Teletext Pg 750
## 
## doc00265; spam; valid; ham; 0.582015539203516
## Want explicit SEX in 30 secs? Ring 02073162414 now! Costs 20p/min

Now, print out ten random ham messages that were mis-classified.

# Question 10
model$docs %>%
  filter(label == "ham") %>%
  filter(label != pred_label) %>%
  slice_sample(n = 10) %>%
  dsst_print_text()
## doc00113; ham; valid; spam; 0.877789205952256
## Oh thats late! Well have a good night and i will give u a call
## tomorrow. Iam now going to go to sleep night night
## 
## doc00764; ham; valid; spam; 0.641227009753046
## I‘ll have a look at the frying pan in case it‘s cheap or a book
## perhaps. No that‘s silly a frying pan isn‘t likely to be a book
## 
## doc01033; ham; valid; spam; 0.581212523413683
## Nimbomsons. Yep phone knows that one. Obviously, cos thats a real word
## 
## doc00733; ham; valid; spam; 0.541082987032844
## Sir Goodmorning, Once free call me.
## 
## doc00928; ham; valid; spam; 0.864890363509263
## Hey there! Glad u r better now. I hear u treated urself to a digi cam,
## is it good? We r off at 9pm. Have a fab new year, c u in coupla wks!
## 
## doc01127; ham; valid; spam; 0.611784957326342
## Happy new years melody!
## 
## doc01256; ham; valid; spam; 0.559170267274495
## My slave! I want you to take 2 or 3 pictures of yourself today in
## bright light on your cell phone! Bright light!
## 
## doc01037; ham; valid; spam; 0.614985394011444
## How's it feel? Mr. Your not my real Valentine just my yo Valentine
## even tho u hardly play!!
## 
## doc00366; ham; valid; spam; 0.590642494406985
## Life spend with someone for a lifetime may be meaningless but a few
## moments spent with someone who really love you means more than life
## itself..
## 
## doc00221; ham; valid; spam; 0.73851105629729
## Hi its Kate how is your evening? I hope i can see you tomorrow for a
## bit but i have to bloody babyjontet! Txt back if u can. :) xxx