Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.
After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.
I have set the options message=FALSE
and
echo=FALSE
to avoid cluttering your solutions with all the
output from this code.
Today we are going to look at a subset of a well-known text analysis corpus call NewsGroups-20. It’s an old set of mailing list archives from 20 different categories.
<- read_csv("../data/newsgroups.csv.bz2")
docs <- read_csv("../data/newsgroups_token.csv.bz2") anno
Build an elastic net model to determine the category of the newsgroup messages:
# Question 01
<- dsst_enet_build(anno, docs) model
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
Produce a confusion matrix for the messages using just the validation data. Take note of any commonly confused categories:
# Question 02
$docs %>%
modelfilter(train_id == "valid") %>%
select(label, pred_label) %>%
table()
## pred_label
## label alt.atheism comp.graphics
## alt.atheism 15 0
## comp.graphics 0 7
## comp.os.ms-windows.misc 0 0
## comp.sys.ibm.pc.hardware 0 2
## comp.sys.mac.hardware 2 5
## comp.windows.x 0 1
## misc.forsale 0 2
## rec.autos 0 0
## rec.motorcycles 0 0
## rec.sport.baseball 1 0
## rec.sport.hockey 0 0
## sci.crypt 0 1
## sci.electronics 0 1
## sci.med 2 1
## sci.space 0 0
## soc.religion.christian 1 0
## talk.politics.guns 1 0
## talk.politics.mideast 1 0
## talk.politics.misc 0 1
## talk.religion.misc 7 1
## pred_label
## label comp.os.ms-windows.misc comp.sys.ibm.pc.hardware
## alt.atheism 0 0
## comp.graphics 6 0
## comp.os.ms-windows.misc 18 1
## comp.sys.ibm.pc.hardware 5 12
## comp.sys.mac.hardware 0 2
## comp.windows.x 4 1
## misc.forsale 2 2
## rec.autos 0 0
## rec.motorcycles 0 0
## rec.sport.baseball 0 0
## rec.sport.hockey 0 0
## sci.crypt 0 2
## sci.electronics 1 2
## sci.med 1 0
## sci.space 1 0
## soc.religion.christian 0 0
## talk.politics.guns 0 0
## talk.politics.mideast 0 0
## talk.politics.misc 0 1
## talk.religion.misc 0 0
## pred_label
## label comp.sys.mac.hardware comp.windows.x
## alt.atheism 0 0
## comp.graphics 2 5
## comp.os.ms-windows.misc 1 7
## comp.sys.ibm.pc.hardware 7 0
## comp.sys.mac.hardware 20 1
## comp.windows.x 0 25
## misc.forsale 3 1
## rec.autos 0 0
## rec.motorcycles 0 0
## rec.sport.baseball 0 0
## rec.sport.hockey 0 0
## sci.crypt 0 3
## sci.electronics 2 1
## sci.med 1 0
## sci.space 0 1
## soc.religion.christian 0 0
## talk.politics.guns 1 0
## talk.politics.mideast 0 0
## talk.politics.misc 1 1
## talk.religion.misc 0 2
## pred_label
## label misc.forsale rec.autos rec.motorcycles
## alt.atheism 0 2 1
## comp.graphics 7 1 0
## comp.os.ms-windows.misc 0 0 0
## comp.sys.ibm.pc.hardware 3 0 0
## comp.sys.mac.hardware 1 0 0
## comp.windows.x 1 1 0
## misc.forsale 20 1 1
## rec.autos 3 23 3
## rec.motorcycles 0 1 35
## rec.sport.baseball 1 0 0
## rec.sport.hockey 1 0 0
## sci.crypt 1 0 0
## sci.electronics 3 3 0
## sci.med 2 0 1
## sci.space 2 0 1
## soc.religion.christian 3 0 1
## talk.politics.guns 1 2 1
## talk.politics.mideast 1 0 0
## talk.politics.misc 3 1 2
## talk.religion.misc 0 0 1
## pred_label
## label rec.sport.baseball rec.sport.hockey sci.crypt
## alt.atheism 0 0 0
## comp.graphics 0 0 0
## comp.os.ms-windows.misc 0 1 0
## comp.sys.ibm.pc.hardware 0 0 0
## comp.sys.mac.hardware 0 0 1
## comp.windows.x 0 1 1
## misc.forsale 2 0 0
## rec.autos 0 1 0
## rec.motorcycles 1 0 0
## rec.sport.baseball 18 6 1
## rec.sport.hockey 1 32 0
## sci.crypt 0 0 27
## sci.electronics 0 1 4
## sci.med 0 0 0
## sci.space 0 0 1
## soc.religion.christian 0 1 0
## talk.politics.guns 1 1 1
## talk.politics.mideast 0 1 0
## talk.politics.misc 0 0 0
## talk.religion.misc 0 0 1
## pred_label
## label sci.electronics sci.med sci.space
## alt.atheism 0 1 1
## comp.graphics 1 0 1
## comp.os.ms-windows.misc 0 0 1
## comp.sys.ibm.pc.hardware 1 0 1
## comp.sys.mac.hardware 1 0 1
## comp.windows.x 0 0 0
## misc.forsale 0 0 1
## rec.autos 4 0 0
## rec.motorcycles 1 1 0
## rec.sport.baseball 0 0 0
## rec.sport.hockey 0 0 0
## sci.crypt 1 0 0
## sci.electronics 8 0 1
## sci.med 3 18 1
## sci.space 2 1 21
## soc.religion.christian 0 2 0
## talk.politics.guns 0 0 0
## talk.politics.mideast 0 1 0
## talk.politics.misc 0 0 0
## talk.religion.misc 0 0 0
## pred_label
## label soc.religion.christian talk.politics.guns
## alt.atheism 4 1
## comp.graphics 0 0
## comp.os.ms-windows.misc 0 0
## comp.sys.ibm.pc.hardware 0 0
## comp.sys.mac.hardware 1 0
## comp.windows.x 0 0
## misc.forsale 1 1
## rec.autos 0 2
## rec.motorcycles 0 0
## rec.sport.baseball 0 2
## rec.sport.hockey 0 0
## sci.crypt 0 1
## sci.electronics 1 0
## sci.med 0 0
## sci.space 1 1
## soc.religion.christian 26 0
## talk.politics.guns 1 10
## talk.politics.mideast 2 1
## talk.politics.misc 0 2
## talk.religion.misc 6 4
## pred_label
## label talk.politics.mideast talk.politics.misc
## alt.atheism 1 2
## comp.graphics 0 0
## comp.os.ms-windows.misc 0 0
## comp.sys.ibm.pc.hardware 0 0
## comp.sys.mac.hardware 0 2
## comp.windows.x 0 0
## misc.forsale 0 1
## rec.autos 0 0
## rec.motorcycles 0 1
## rec.sport.baseball 0 2
## rec.sport.hockey 0 0
## sci.crypt 0 1
## sci.electronics 0 0
## sci.med 1 0
## sci.space 2 0
## soc.religion.christian 3 2
## talk.politics.guns 2 6
## talk.politics.mideast 26 1
## talk.politics.misc 0 17
## talk.religion.misc 1 2
## pred_label
## label talk.religion.misc
## alt.atheism 22
## comp.graphics 20
## comp.os.ms-windows.misc 21
## comp.sys.ibm.pc.hardware 19
## comp.sys.mac.hardware 13
## comp.windows.x 15
## misc.forsale 12
## rec.autos 14
## rec.motorcycles 10
## rec.sport.baseball 19
## rec.sport.hockey 16
## sci.crypt 13
## sci.electronics 22
## sci.med 19
## sci.space 16
## soc.religion.christian 11
## talk.politics.guns 22
## talk.politics.mideast 16
## talk.politics.misc 21
## talk.religion.misc 25
Look at the coefficients from the model; perhaps use
lambda_num = 30
. Use the code from Project 2 to look at the
positive (and negative, if there are any) terms associated with each
category. Do the terms seem to correspond to the categories in an
expected way? Note: You can pipe the whole thing into the function
View()
if you want a better way to look at the output in
RStudio.
# Question 03
dsst_coef(model$model, lambda_num = 30, to_tibble = TRUE) %>%
filter(term != "(Intercept)") %>%
pivot_longer(names_to = "label", values_to = "coef", cols = -c(term, MLN)) %>%
filter(coef != 0) %>%
mutate(direction = if_else(sign(coef) > 0, "positive", "negative")) %>%
group_by(label, direction) %>%
summarize(term = paste(term, collapse = " | ")) %>%
pivot_wider(
id_cols = "label",
values_from = "term",
names_from = "direction",
values_fill = ""
#%>% )
## # A tibble: 20 × 2
## # Groups: label [20]
## label positive
## <chr> <chr>
## 1 alt.atheism atheist | jon | Koran | benedikt
## 2 comp.graphics graphic | computer
## 3 comp.os.ms-windows.misc Windows | Ultra | 3.1
## 4 comp.sys.ibm.pc.hardware IDE | floppy | Lang | clone | bus
## 5 comp.sys.mac.hardware Mac | tech | PDS | upgrade | Apple | mac
## 6 comp.windows.x window | X11R5 | text
## 7 misc.forsale sale | shipping | interested | condition | obo
## 8 rec.autos car
## 9 rec.motorcycles dod | ride | bike | DoD | Stafford | Winona | B…
## 10 rec.sport.baseball pitcher | baseball | Baseball | Morris | battin…
## 11 rec.sport.hockey playoff | hockey | game | Vancouver | play | Pe…
## 12 sci.crypt encryption | key | Sternlight | Clipper | NSA
## 13 sci.electronics circuit | resistor | differential | ohm
## 14 sci.med medical | therapy | medicine | Medicine | sympt…
## 15 sci.space orbit | space | pat | 91109 | 3684 | 525 | awet…
## 16 soc.religion.christian Christ | Bible | church | Christians | scriptur…
## 17 talk.politics.guns firearm | weapon | gun | cult | shooting
## 18 talk.politics.mideast Israel | israeli | turkish | occupy | soviet | …
## 19 talk.politics.misc Clayton | libertarian | situation
## 20 talk.religion.misc thrilling | BD
#View()
Now, use the G-score metrics to find the 4 terms that are most associated with each category. Again, do these seem to match your intuition?
# Question 04
dsst_metrics(anno, docs) %>%
group_by(label) %>%
slice_head(n = 4)
## # A tibble: 80 × 8
## # Groups: label [20]
## train_id label token count expec…¹ count…² gscore chi2
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 train alt.atheism "be" 442 169. 7660 322. 459.
## 2 train alt.atheism "\"" 192 44.9 2032 278. 496.
## 3 train alt.atheism "," 416 184. 8329 228. 306.
## 4 train alt.atheism "." 353 176. 7968 145. 186.
## 5 train comp.graphics "," 672 276. 8329 437. 602.
## 6 train comp.graphics "-" 280 64.8 1957 418. 743.
## 7 train comp.graphics "ima… 60 2.81 85 308. 1203.
## 8 train comp.graphics "dat… 61 4.14 125 247. 809.
## 9 train comp.os.ms-windows.misc "I" 220 42.8 3183 382. 751.
## 10 train comp.os.ms-windows.misc "Win… 30 0.564 42 209. 1557.
## # … with 70 more rows, and abbreviated variable names ¹expected,
## # ²count_word
Let’s move on to new material. Compute the first two principal components of the categories. Remember to set the document variable to “label”.
# Question 05
%>%
anno inner_join(docs, by = "doc_id") %>%
dsst_pca(doc_var = "label")
## # A tibble: 20 × 3
## label v1 v2
## <chr> <dbl> <dbl>
## 1 alt.atheism 0.254 0.293
## 2 comp.graphics -0.252 0.188
## 3 comp.os.ms-windows.misc -0.330 0.286
## 4 comp.sys.ibm.pc.hardware -0.369 0.265
## 5 comp.sys.mac.hardware -0.321 0.216
## 6 comp.windows.x -0.257 0.248
## 7 misc.forsale -0.258 0.0577
## 8 rec.autos -0.0191 -0.118
## 9 rec.motorcycles -0.0150 -0.132
## 10 rec.sport.baseball 0.0372 -0.337
## 11 rec.sport.hockey 0.0419 -0.329
## 12 sci.crypt 0.0249 0.0980
## 13 sci.electronics -0.146 0.0430
## 14 sci.med 0.0552 -0.0248
## 15 sci.space -0.0200 -0.0580
## 16 soc.religion.christian 0.310 0.359
## 17 talk.politics.guns 0.225 0.163
## 18 talk.politics.mideast 0.177 0.0805
## 19 talk.politics.misc 0.244 0.138
## 20 talk.religion.misc 0.353 0.407
Plot (in R) the first two principal components of the categories. Add labels using a text repel layer. Try to find some of the document pairs in the PCA plot.
# Question 06
%>%
anno inner_join(docs, by = "doc_id") %>%
dsst_pca(doc_var = "label") %>%
ggplot(aes(v1, v2)) +
geom_point() +
geom_text_repel(aes(label = label))
Now, produce a corresponding UMAP plot. Is this easier or more difficult to interpret?
# Question 07
%>%
anno inner_join(docs, by = "doc_id") %>%
dsst_umap(doc_var = "label") %>%
ggplot(aes(v1, v2)) +
geom_point() +
geom_text_repel(aes(label = label))
Next, produce the principal components for the messages themselves. Save the results as a JSON file and go to the link below to visualize the results. Color the points based on the labels.
# Question 08
dsst_pca(anno) %>% dsst_json_drep(docs, color_var = "label")
Repeat the last question for the UMAP parameters. Did you find any interesting clusters of documents?
# Question 09
dsst_umap(anno) %>% dsst_json_drep(docs, color_var = "label")
Make sure to not rush through this step; take a couple minutes to pan around in the embedding space.