Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.
After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.
I have set the options message=FALSE
and
echo=FALSE
to avoid cluttering your solutions with all the
output from this code.
To start, let’s read in some text data to study for today. The dataset for this notebook consists of some spam text messages.
<- read_csv("../data/spam.csv")
docs <- read_csv("../data/spam_token.csv.gz") anno
The data is a bit old (I think around 2005) and from the UK. It is fairly easy to classify this dataset, so I think it is a good place to start.
In the code blocks below, you should type in the solution to each of the questions. Take your time with this; the code is not complex. The real point is to understand each of the steps and what we learn about the data from the model. There are also some short answers that you can fill in after the phrase “Answer”.
Start by creating an object called model
that builds an
elastic net to predict whether a message is spam or not. Use all of the
default values.
# Question 01
<- dsst_enet_build(anno, docs) model
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
Now, compute the error rate for the training and validation sets.
# Question 02
$docs %>%
modelgroup_by(train_id) %>%
summarize(erate = mean(label != pred_label))
## # A tibble: 2 × 2
## train_id erate
## <chr> <dbl>
## 1 train 0.00525
## 2 valid 0.0778
Does the model do better on the training or the validation data? Answer: Better on the training data.
In the next code block, compute the segmented error rate.
# Question 03
$docs %>%
modelgroup_by(train_id, label) %>%
summarize(erate = mean(label != pred_label))
## # A tibble: 4 × 3
## # Groups: train_id [2]
## train_id label erate
## <chr> <chr> <dbl>
## 1 train ham 0
## 2 train spam 0.0103
## 3 valid ham 0.0491
## 4 valid spam 0.108
And next, compute the confusion matrix:
# Question 04
$docs %>%
modelselect(label, pred_label, train_id) %>%
table()
## , , train_id = train
##
## pred_label
## label ham spam
## ham 373 0
## spam 4 385
##
## , , train_id = valid
##
## pred_label
## label ham spam
## ham 252 13
## spam 27 222
Is the model more likely to think that spam is ham, or ham is spam? Answer: My model was more likely to think that spam was ham.
In the next step, compute the coefficients of the model. Note that the model has fit a multinomial elastic net, but with only two classes. You should see that there are two columns with opposite signs. See question 4 on the first handout for an explanation of why this happens.
# Question 05
dsst_coef(model$model)
## 204 x 3 sparse Matrix of class "dgCMatrix"
## ham spam MLN
## (Intercept) 1.0084374502 -1.0084374502 .
## call -0.8231671885 0.8231671885 2
## £ -0.7522423777 0.7522423777 5
## to -0.3080336861 0.3080336861 11
## ! -0.3109799273 0.3109799273 14
## or -0.3646170259 0.3646170259 17
## txt -0.5252800368 0.5252800368 19
## free -0.3757629076 0.3757629076 19
## / -0.4221107055 0.4221107055 21
## ... 0.1756372131 -0.1756372131 22
## -PRON- 0.1081410116 -0.1081410116 25
## 150p -0.5152989181 0.5152989181 26
## text -0.5316047171 0.5316047171 27
## not 0.1919928919 -0.1919928919 29
## i 0.1591052833 -0.1591052833 31
## - -0.1112908507 0.1112908507 31
## service -0.5812371387 0.5812371387 32
## come 0.3253504695 -0.3253504695 32
## new -0.9248682349 0.9248682349 33
## for -0.3157157047 0.3157157047 33
## .. 0.2140391762 -0.2140391762 36
## a -0.1884212035 0.1884212035 40
## CALL -0.9403309695 0.9403309695 41
## from -0.1695438209 0.1695438209 43
## claim -0.4201778465 0.4201778465 45
## but 0.1955530241 -0.1955530241 46
## ringtone -0.8026865693 0.8026865693 47
## per -0.2848580861 0.2848580861 47
## ok 0.2761713748 -0.2761713748 47
## now -0.1569656213 0.1569656213 48
## contact -0.0803335088 0.0803335088 49
## later 0.3581946904 -0.3581946904 50
## STOP -0.1810376489 0.1810376489 50
## : -0.1528404972 0.1528404972 50
## 2 -0.1095949195 0.1095949195 50
## content -0.7877948922 0.7877948922 51
## message -0.3211211162 0.3211211162 52
## Text -0.2798553096 0.2798553096 55
## charge -0.1961128551 0.1961128551 55
## pc -1.5561457273 1.5561457273 56
## Txt -0.3755561087 0.3755561087 56
## ac 2.8108694570 -2.8108694570 57
## happen 1.8019297596 -1.8019297596 57
## real -1.2451931572 1.2451931572 57
## ago 1.1183056337 -1.1183056337 57
## BABE 0.7378975738 -0.7378975738 57
## CARE 0.5291430117 -0.5291430117 57
## END 0.3956857568 -0.3956857568 57
## fault 0.2957462381 -0.2957462381 57
## model 0.2519324228 -0.2519324228 57
## quit 0.1690944459 -0.1690944459 57
## XXXX 0.1585458243 -0.1585458243 57
## that 0.1306502294 -0.1306502294 57
## sexy -0.4392750068 0.4392750068 58
## ; 0.0358650206 -0.0358650206 58
## 4 -0.0198986612 0.0198986612 58
## credit -0.6799734578 0.6799734578 59
## user -0.6227279358 0.6227279358 59
## shortly -0.5452655200 0.5452655200 59
## urgent -0.1424160822 0.1424160822 59
## adult -0.5412390937 0.5412390937 60
## some1 1.2035479211 -1.2035479211 61
## work 0.2273503279 -0.2273503279 61
## filthy -0.9747040004 0.9747040004 62
## Fri -0.8570074734 0.8570074734 62
## hi -0.2273516161 0.2273516161 62
## further -0.9146856144 0.9146856144 63
## 88066 -0.5213212130 0.5213212130 63
## Calls£1 -0.4340847396 0.4340847396 63
## landline -0.4040765880 0.4040765880 63
## HELP -0.3306199544 0.3306199544 63
## 18 -0.2410153124 0.2410153124 63
## wait -0.1639412186 0.1639412186 63
## than -0.8568495084 0.8568495084 64
## an -0.1423287967 0.1423287967 64
## C -0.4697877313 0.4697877313 65
## apply -0.4443067985 0.4443067985 65
## who -0.2682150770 0.2682150770 65
## alone -0.1720596580 0.1720596580 65
## win -0.1661321280 0.1661321280 65
## do 0.0475797233 -0.0475797233 65
## email 0.4853820833 -0.4853820833 66
## FreeMsg -0.1278164029 0.1278164029 66
## smoke 0.1231981109 -0.1231981109 66
## sure 0.0842850270 -0.0842850270 66
## something 0.2976554764 -0.2976554764 67
## eg -0.2903165256 0.2903165256 67
## because 0.2987714342 -0.2987714342 68
## private -0.0948374504 0.0948374504 68
## August -0.7303431714 0.7303431714 69
## amazing -0.4148065712 0.4148065712 69
## blow -0.2354224462 0.2354224462 69
## truly -0.0986818735 0.0986818735 69
## 18p -0.0343212453 0.0343212453 69
## G -0.0156083207 0.0156083207 69
## incredible -0.0056723872 0.0056723872 69
## Txts -0.0029722029 0.0029722029 69
## learn -0.0018284846 0.0018284846 69
## quality -0.8901071966 0.8901071966 70
## darle -0.7950218177 0.7950218177 70
## 2003 -0.4854072660 0.4854072660 70
## unique -0.4702816606 0.4702816606 70
## Ltd -0.3882975934 0.3882975934 70
## yeah 0.1172703272 -0.1172703272 70
## UK -0.0168007956 0.0168007956 70
## web -0.7055803014 0.7055803014 71
## Saturday -0.4135934388 0.4135934388 71
## Account -0.1060924471 0.1060924471 71
## at 0.0804059794 -0.0804059794 71
## Statement -0.0055682486 0.0055682486 71
## hav 0.4049005596 -0.4049005596 72
## ( -0.2294620915 0.2294620915 72
## ask 0.1610387828 -0.1610387828 72
## da 0.0793355313 -0.0793355313 72
## -message -0.3701294005 0.3701294005 73
## Reply -0.2047103851 0.2047103851 73
## +447797706009 -0.1564985707 0.1564985707 73
## chat -0.1200875849 0.1200875849 73
## Bloomberg -0.0510654493 0.0510654493 73
## car 0.0459093134 -0.0459093134 73
## bloomberg.com -0.0131397142 0.0131397142 73
## http://career -0.0057015856 0.0057015856 73
## evening -0.6069564283 0.6069564283 74
## bird -0.7586561993 0.7586561993 75
## still 0.1181729487 -0.1181729487 75
## O2FWD -0.0070181842 0.0070181842 75
## More -0.2711644094 0.2711644094 76
## R -0.2013060309 0.2013060309 76
## Break -0.1498053219 0.1498053219 77
## center -0.0032982197 0.0032982197 77
## widelive.com/index -0.1427821271 0.1427821271 78
## REAL -0.7663308222 0.7663308222 79
## % -0.3023782616 0.3023782616 79
## 5.00 -0.1979743129 0.1979743129 79
## 83222 -0.1309555766 0.1309555766 79
## GET -0.0593884083 0.0593884083 79
## our -0.0530656747 0.0530656747 79
## Bank -0.0517554691 0.0517554691 79
## in 0.0471452491 -0.0471452491 79
## CDGT -0.0106478395 0.0106478395 79
## Strong -0.0077021589 0.0077021589 79
## explosive -0.0027567399 0.0027567399 79
## Granite -0.0027421313 0.0027421313 79
## Symbol -0.0010990120 0.0010990120 79
## Nasdaq -0.0004858133 0.0004858133 79
## Gr8 -0.3533136700 0.3533136700 80
## ring 0.1350919599 -0.1350919599 80
## press -0.1291756232 0.1291756232 80
## 0 -0.0455911763 0.0455911763 80
## result -0.0116246297 0.0116246297 80
## Unsubscribe -0.0103356007 0.0103356007 80
## EURO -0.0067350131 0.0067350131 80
## Arcade -0.0067061587 0.0067061587 80
## inform -0.0026194793 0.0026194793 80
## euro2004 -0.0020511824 0.0020511824 80
## kickoff -0.0008069822 0.0008069822 80
## within -0.1448261898 0.1448261898 81
## boss 0.1292309713 -0.1292309713 81
## Camera -0.0624063695 0.0624063695 81
## Digital -0.0192030565 0.0192030565 81
## SiPix -0.0067517645 0.0067517645 81
## student 0.2160818661 -0.2160818661 82
## top -0.1625630261 0.1625630261 82
## smile 0.0801039352 -0.0801039352 83
## news -0.0584177422 0.0584177422 83
## 62468 -0.1330209665 0.1330209665 84
## FRND -0.0382388180 0.0382388180 84
## able 0.1656260743 -0.1656260743 85
## part -0.1243590703 0.1243590703 85
## 8007 -0.0770602477 0.0770602477 85
## turn 0.0500851918 -0.0500851918 85
## when 0.0229457381 -0.0229457381 85
## purchase -0.2185626139 0.2185626139 86
## log -0.1515582780 0.1515582780 86
## http://www.urawinner.com -0.0142195548 0.0142195548 86
## your -0.1028154054 0.1028154054 87
## shop 0.0670019543 -0.0670019543 87
## Latest -0.0826997442 0.0826997442 89
## dad 0.0819194698 -0.0819194698 89
## IM 0.0810717166 -0.0810717166 89
## follow -0.0650273962 0.0650273962 89
## 2005 -0.0951562895 0.0951562895 90
## POLY -0.0380859820 0.0380859820 90
## send -0.0182862498 0.0182862498 90
## mail 0.0615440574 -0.0615440574 91
## News -0.1294544173 0.1294544173 92
## 542 -0.0926237415 0.0926237415 92
## mrng 0.0603571538 -0.0603571538 92
## :) 0.0088514800 -0.0088514800 92
## hint 0.0222340757 -0.0222340757 93
## file 0.0164144008 -0.0164144008 93
## this -0.0110794815 0.0110794815 93
## then 0.0003830135 -0.0003830135 93
## lady -0.0506076020 0.0506076020 94
## career -0.0258725840 0.0258725840 94
## hw 0.0127901525 -0.0127901525 94
## star -0.0023686148 0.0023686148 94
## find -0.0007062404 0.0007062404 94
## partner -0.0130315931 0.0130315931 95
## cause 0.0141779851 -0.0141779851 96
## voicemail -0.0104162535 0.0104162535 96
## tv -0.0055463087 0.0055463087 96
## chart -0.0052122189 0.0052122189 96
## subscriber -0.0005832920 0.0005832920 96
Look at the first few values in the table above. Do the strongest features that the model found make sense to you as an indicator or spam? Answer: I found the strongest features to be “call”, “£”, “to” and “!”. All the signs are positive for spam. I can certainly imagine all of these featuring in a spam text message.
Reproduce the coefficients, cutting off after a particular lambda number to get only a small number of results (about a dozen is good).
# Question 06
dsst_coef(model$model, lambda_num = 27)
## 13 x 3 sparse Matrix of class "dgCMatrix"
## ham spam MLN
## (Intercept) 0.249848248 -0.249848248 .
## call -0.408994453 0.408994453 2
## £ -0.243674928 0.243674928 5
## to -0.099597910 0.099597910 11
## ! -0.076051815 0.076051815 14
## or -0.084007961 0.084007961 17
## txt -0.110410147 0.110410147 19
## free -0.078472933 0.078472933 19
## / -0.054961656 0.054961656 21
## ... 0.038612052 -0.038612052 22
## -PRON- 0.005298340 -0.005298340 25
## 150p -0.022383909 0.022383909 26
## text -0.003014019 0.003014019 27
Next, print out 10 messages that have the highest probability of
being spam. Use the function dsst_print_text()
to print out
the messages and take a moment to read them.
# Question 07
$docs %>%
modelfilter(pred_label == "spam") %>%
arrange(desc(pred_value)) %>%
slice_head(n = 10) %>%
dsst_print_text()
## doc00202; spam; valid; spam; 0.999992138321575
## Ur ringtone service has changed! 25 Free credits! Go to
## club4mobiles.com to choose content now! Stop? txt CLUB STOP to 87070.
## 150p/wk Club4 PO Box1146 MK45 2WT
##
## doc00855; spam; train; spam; 0.999982353729283
## Free-message: Jamster!Get the crazy frog sound now! For poly text
## MAD1, for real text MAD2 to 88888. 6 crazy sounds for just 3 GBP/week!
## 16+only! T&C's apply
##
## doc00661; spam; valid; spam; 0.999889171471002
## Had your mobile 10 mths? Update to latest Orange camera/video phones
## for FREE. Save £s with Free texts/weekend calls. Text YES for a
## callback orno to opt out
##
## doc00214; spam; valid; spam; 0.999833415323564
## January Male Sale! Hot Gay chat now cheaper, call 08709222922.
## National rate from 1.5p/min cheap to 7.8p/min peak! To stop texts call
## 08712460324 (10p/min)
##
## doc00250; spam; valid; spam; 0.999652384688644
## Freemsg: 1-month unlimited free calls! Activate SmartCall Txt: CALL to
## No: 68866. Subscriptn3gbp/wk unlimited calls Help: 08448714184 Stop?
## txt stop landlineonly
##
## doc00057; spam; train; spam; 0.999578061304966
## I want some cock! My hubby's away, I need a real man 2 satisfy me. Txt
## WIFE to 89938 for no strings action. (Txt STOP 2 end, txt rec £1.50ea.
## OTBox 731 LA1 7WS. )
##
## doc00710; spam; train; spam; 0.999548735734572
## 3 FREE TAROT TEXTS! Find out about your love life now! TRY 3 FOR FREE!
## Text CHANCE to 85555 16 only! After 3 Free, Msgs £1.50 each
##
## doc00830; spam; train; spam; 0.99949402694429
## Well done ENGLAND! Get the official poly ringtone or colour flag on
## yer mobile! text TONE or FLAG to 84199 NOW! Opt-out txt ENG STOP.
## Box39822 W111WX £1.50
##
## doc01134; spam; train; spam; 0.99949402694429
## Well done ENGLAND! Get the official poly ringtone or colour flag on
## yer mobile! text TONE or FLAG to 84199 NOW! Opt-out txt ENG STOP.
## Box39822 W111WX £1.50
##
## doc01099; spam; valid; spam; 0.99939969586009
## Great News! Call FREEFONE 08006344447 to claim your guaranteed £1000
## CASH or £2000 gift. Speak to a live operator NOW!
Repeat the last question, but select the ten messages most likely to be ham.
# Question 08
$docs %>%
modelfilter(pred_label == "ham") %>%
arrange(desc(pred_value)) %>%
slice_head(n = 10) %>%
dsst_print_text()
## doc00734; ham; train; ham; 0.999995428135907
## The last thing i ever wanted to do was hurt you. And i didn't think it
## would have. You'd laugh, be embarassed, delete the tag and keep going.
## But as far as i knew, it wasn't even up. The fact that you even felt
## like i would do it to hurt you shows you really don't know me at all.
## It was messy wednesday, but it wasn't bad. The problem i have with it
## is you HAVE the time to clean it, but you choose not to. You skype,
## you take pictures, you sleep, you want to go out. I don't mind a few
## things here
##
## doc00226; spam; valid; ham; 0.999615059207897
## SMS. ac JSco: Energy is high, but u may not know where 2channel it.
## 2day ur leadership skills r strong. Psychic? Reply ANS w/question.
## End? Reply END JSCO
##
## doc00025; spam; valid; ham; 0.998646488829747
## SMS. ac Sptv: The New Jersey Devils and the Detroit Red Wings play Ice
## Hockey. Correct or Incorrect? End? Reply END SPTV
##
## doc00987; ham; train; ham; 0.998102983123443
## Wow. I never realized that you were so embarassed by your
## accomodations. I thought you liked it, since i was doing the best i
## could and you always seemed so happy about "the cave". I'm sorry I
## didn't and don't have more to give. I'm sorry i offered. I'm sorry
## your room was so embarassing.
##
## doc00476; ham; valid; ham; 0.997655412518001
## Are you angry with me. What happen dear
##
## doc01252; ham; valid; ham; 0.997655412518001
## I'm aight. Wat's happening on your side.
##
## doc01035; ham; valid; ham; 0.995760396588148
## Probably money worries. Things are coming due and i have several
## outstanding invoices for work i did two and three months ago.
##
## doc00574; ham; train; ham; 0.992649102839132
## Okay lor... Wah... like that def they wont let us go... Haha... What
## did they say in the terms and conditions?
##
## doc00519; ham; valid; ham; 0.991784016399156
## Okay lor... Will they still let us go a not ah? Coz they will not know
## until later. We drop our cards into the box right?
##
## doc00495; ham; train; ham; 0.991052631029984
## No dude, its not fake..my frnds got money, thts y i'm reffering u..if
## u member wit my mail link, u vl be credited <#> rs and il be
## getiing <#> rs..i can draw my acc wen it is <#> rs..
Next, print out ten random spam messages that were mis-classified.
# Question 09
$docs %>%
modelfilter(label == "spam") %>%
filter(label != pred_label) %>%
slice_sample(n = 10) %>%
dsst_print_text()
## doc00703; spam; valid; ham; 0.515106762048036
## Get 3 Lions England tone, reply lionm 4 mono or lionp 4 poly. 4 more
## go 2 www.ringtones.co.uk, the original n best. Tones 3GBP network
## operator rates apply.
##
## doc00336; spam; valid; ham; 0.805331623254029
## A link to your picture has been sent. You can also use http://
## alto18.co.uk/wave/wave.asp?o=44345
##
## doc00801; spam; valid; ham; 0.903187750837926
## Check Out Choose Your Babe Videos @ sms.shsex.netUN fgkslpoPW fgkslpo
##
## doc00475; spam; train; ham; 0.5331847877818
## Your credits have been topped up for http://www.bubbletext.com Your
## renewal Pin is tgxxrz
##
## doc01108; spam; valid; ham; 0.707242898252999
## tddnewsletter@emc1.co.uk (More games from TheDailyDraw) Dear Helen,
## Dozens of Free Games - with great prizesWith..
##
## doc00887; spam; valid; ham; 0.705545996909835
## 3. You have received your mobile content. Enjoy
##
## doc00617; spam; valid; ham; 0.515106762048036
## Get 3 Lions England tone, reply lionm 4 mono or lionp 4 poly. 4 more
## go 2 www.ringtones.co.uk, the original n best. Tones 3GBP network
## operator rates apply
##
## doc01034; spam; train; ham; 0.882557480301154
## ringtoneking 84484
##
## doc00066; spam; valid; ham; 0.821765526679288
## You will recieve your tone within the next 24hrs. For Terms and
## conditions please see Channel U Teletext Pg 750
##
## doc00265; spam; valid; ham; 0.582015539203516
## Want explicit SEX in 30 secs? Ring 02073162414 now! Costs 20p/min
Now, print out ten random ham messages that were mis-classified.
# Question 10
$docs %>%
modelfilter(label == "ham") %>%
filter(label != pred_label) %>%
slice_sample(n = 10) %>%
dsst_print_text()
## doc00113; ham; valid; spam; 0.877789205952256
## Oh thats late! Well have a good night and i will give u a call
## tomorrow. Iam now going to go to sleep night night
##
## doc00764; ham; valid; spam; 0.641227009753046
## I‘ll have a look at the frying pan in case it‘s cheap or a book
## perhaps. No that‘s silly a frying pan isn‘t likely to be a book
##
## doc01033; ham; valid; spam; 0.581212523413683
## Nimbomsons. Yep phone knows that one. Obviously, cos thats a real word
##
## doc00733; ham; valid; spam; 0.541082987032844
## Sir Goodmorning, Once free call me.
##
## doc00928; ham; valid; spam; 0.864890363509263
## Hey there! Glad u r better now. I hear u treated urself to a digi cam,
## is it good? We r off at 9pm. Have a fab new year, c u in coupla wks!
##
## doc01127; ham; valid; spam; 0.611784957326342
## Happy new years melody!
##
## doc01256; ham; valid; spam; 0.559170267274495
## My slave! I want you to take 2 or 3 pictures of yourself today in
## bright light on your cell phone! Bright light!
##
## doc01037; ham; valid; spam; 0.614985394011444
## How's it feel? Mr. Your not my real Valentine just my yo Valentine
## even tho u hardly play!!
##
## doc00366; ham; valid; spam; 0.590642494406985
## Life spend with someone for a lifetime may be meaningless but a few
## moments spent with someone who really love you means more than life
## itself..
##
## doc00221; ham; valid; spam; 0.73851105629729
## Hi its Kate how is your evening? I hope i can see you tomorrow for a
## bit but i have to bloody babyjontet! Txt back if u can. :) xxx