Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options message=FALSE and echo=FALSE to avoid cluttering your solutions with all the output from this code.

Reading the Data

Today we are going to look at a subset of a well-known text analysis corpus call NewsGroups-20. It’s an old set of mailing list archives from 20 different categories.

docs <- read_csv("../data/newsgroups.csv.bz2")
anno <- read_csv("../data/newsgroups_token.csv.bz2")

Questions

Supervised Learning

Build an elastic net model to determine the category of the newsgroup messages:

# Question 01
model <- dsst_enet_build(anno, docs)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead

Produce a confusion matrix for the messages using just the validation data. Take note of any commonly confused categories:

# Question 02
model$docs %>%
  filter(train_id == "valid") %>%
  select(label, pred_label) %>%
  table()
##                           pred_label
## label                      alt.atheism comp.graphics
##   alt.atheism                       15             0
##   comp.graphics                      0             7
##   comp.os.ms-windows.misc            0             0
##   comp.sys.ibm.pc.hardware           0             2
##   comp.sys.mac.hardware              2             5
##   comp.windows.x                     0             1
##   misc.forsale                       0             2
##   rec.autos                          0             0
##   rec.motorcycles                    0             0
##   rec.sport.baseball                 1             0
##   rec.sport.hockey                   0             0
##   sci.crypt                          0             1
##   sci.electronics                    0             1
##   sci.med                            2             1
##   sci.space                          0             0
##   soc.religion.christian             1             0
##   talk.politics.guns                 1             0
##   talk.politics.mideast              1             0
##   talk.politics.misc                 0             1
##   talk.religion.misc                 7             1
##                           pred_label
## label                      comp.os.ms-windows.misc comp.sys.ibm.pc.hardware
##   alt.atheism                                    0                        0
##   comp.graphics                                  6                        0
##   comp.os.ms-windows.misc                       18                        1
##   comp.sys.ibm.pc.hardware                       5                       12
##   comp.sys.mac.hardware                          0                        2
##   comp.windows.x                                 4                        1
##   misc.forsale                                   2                        2
##   rec.autos                                      0                        0
##   rec.motorcycles                                0                        0
##   rec.sport.baseball                             0                        0
##   rec.sport.hockey                               0                        0
##   sci.crypt                                      0                        2
##   sci.electronics                                1                        2
##   sci.med                                        1                        0
##   sci.space                                      1                        0
##   soc.religion.christian                         0                        0
##   talk.politics.guns                             0                        0
##   talk.politics.mideast                          0                        0
##   talk.politics.misc                             0                        1
##   talk.religion.misc                             0                        0
##                           pred_label
## label                      comp.sys.mac.hardware comp.windows.x
##   alt.atheism                                  0              0
##   comp.graphics                                2              5
##   comp.os.ms-windows.misc                      1              7
##   comp.sys.ibm.pc.hardware                     7              0
##   comp.sys.mac.hardware                       20              1
##   comp.windows.x                               0             25
##   misc.forsale                                 3              1
##   rec.autos                                    0              0
##   rec.motorcycles                              0              0
##   rec.sport.baseball                           0              0
##   rec.sport.hockey                             0              0
##   sci.crypt                                    0              3
##   sci.electronics                              2              1
##   sci.med                                      1              0
##   sci.space                                    0              1
##   soc.religion.christian                       0              0
##   talk.politics.guns                           1              0
##   talk.politics.mideast                        0              0
##   talk.politics.misc                           1              1
##   talk.religion.misc                           0              2
##                           pred_label
## label                      misc.forsale rec.autos rec.motorcycles
##   alt.atheism                         0         2               1
##   comp.graphics                       7         1               0
##   comp.os.ms-windows.misc             0         0               0
##   comp.sys.ibm.pc.hardware            3         0               0
##   comp.sys.mac.hardware               1         0               0
##   comp.windows.x                      1         1               0
##   misc.forsale                       20         1               1
##   rec.autos                           3        23               3
##   rec.motorcycles                     0         1              35
##   rec.sport.baseball                  1         0               0
##   rec.sport.hockey                    1         0               0
##   sci.crypt                           1         0               0
##   sci.electronics                     3         3               0
##   sci.med                             2         0               1
##   sci.space                           2         0               1
##   soc.religion.christian              3         0               1
##   talk.politics.guns                  1         2               1
##   talk.politics.mideast               1         0               0
##   talk.politics.misc                  3         1               2
##   talk.religion.misc                  0         0               1
##                           pred_label
## label                      rec.sport.baseball rec.sport.hockey sci.crypt
##   alt.atheism                               0                0         0
##   comp.graphics                             0                0         0
##   comp.os.ms-windows.misc                   0                1         0
##   comp.sys.ibm.pc.hardware                  0                0         0
##   comp.sys.mac.hardware                     0                0         1
##   comp.windows.x                            0                1         1
##   misc.forsale                              2                0         0
##   rec.autos                                 0                1         0
##   rec.motorcycles                           1                0         0
##   rec.sport.baseball                       18                6         1
##   rec.sport.hockey                          1               32         0
##   sci.crypt                                 0                0        27
##   sci.electronics                           0                1         4
##   sci.med                                   0                0         0
##   sci.space                                 0                0         1
##   soc.religion.christian                    0                1         0
##   talk.politics.guns                        1                1         1
##   talk.politics.mideast                     0                1         0
##   talk.politics.misc                        0                0         0
##   talk.religion.misc                        0                0         1
##                           pred_label
## label                      sci.electronics sci.med sci.space
##   alt.atheism                            0       1         1
##   comp.graphics                          1       0         1
##   comp.os.ms-windows.misc                0       0         1
##   comp.sys.ibm.pc.hardware               1       0         1
##   comp.sys.mac.hardware                  1       0         1
##   comp.windows.x                         0       0         0
##   misc.forsale                           0       0         1
##   rec.autos                              4       0         0
##   rec.motorcycles                        1       1         0
##   rec.sport.baseball                     0       0         0
##   rec.sport.hockey                       0       0         0
##   sci.crypt                              1       0         0
##   sci.electronics                        8       0         1
##   sci.med                                3      18         1
##   sci.space                              2       1        21
##   soc.religion.christian                 0       2         0
##   talk.politics.guns                     0       0         0
##   talk.politics.mideast                  0       1         0
##   talk.politics.misc                     0       0         0
##   talk.religion.misc                     0       0         0
##                           pred_label
## label                      soc.religion.christian talk.politics.guns
##   alt.atheism                                   4                  1
##   comp.graphics                                 0                  0
##   comp.os.ms-windows.misc                       0                  0
##   comp.sys.ibm.pc.hardware                      0                  0
##   comp.sys.mac.hardware                         1                  0
##   comp.windows.x                                0                  0
##   misc.forsale                                  1                  1
##   rec.autos                                     0                  2
##   rec.motorcycles                               0                  0
##   rec.sport.baseball                            0                  2
##   rec.sport.hockey                              0                  0
##   sci.crypt                                     0                  1
##   sci.electronics                               1                  0
##   sci.med                                       0                  0
##   sci.space                                     1                  1
##   soc.religion.christian                       26                  0
##   talk.politics.guns                            1                 10
##   talk.politics.mideast                         2                  1
##   talk.politics.misc                            0                  2
##   talk.religion.misc                            6                  4
##                           pred_label
## label                      talk.politics.mideast talk.politics.misc
##   alt.atheism                                  1                  2
##   comp.graphics                                0                  0
##   comp.os.ms-windows.misc                      0                  0
##   comp.sys.ibm.pc.hardware                     0                  0
##   comp.sys.mac.hardware                        0                  2
##   comp.windows.x                               0                  0
##   misc.forsale                                 0                  1
##   rec.autos                                    0                  0
##   rec.motorcycles                              0                  1
##   rec.sport.baseball                           0                  2
##   rec.sport.hockey                             0                  0
##   sci.crypt                                    0                  1
##   sci.electronics                              0                  0
##   sci.med                                      1                  0
##   sci.space                                    2                  0
##   soc.religion.christian                       3                  2
##   talk.politics.guns                           2                  6
##   talk.politics.mideast                       26                  1
##   talk.politics.misc                           0                 17
##   talk.religion.misc                           1                  2
##                           pred_label
## label                      talk.religion.misc
##   alt.atheism                              22
##   comp.graphics                            20
##   comp.os.ms-windows.misc                  21
##   comp.sys.ibm.pc.hardware                 19
##   comp.sys.mac.hardware                    13
##   comp.windows.x                           15
##   misc.forsale                             12
##   rec.autos                                14
##   rec.motorcycles                          10
##   rec.sport.baseball                       19
##   rec.sport.hockey                         16
##   sci.crypt                                13
##   sci.electronics                          22
##   sci.med                                  19
##   sci.space                                16
##   soc.religion.christian                   11
##   talk.politics.guns                       22
##   talk.politics.mideast                    16
##   talk.politics.misc                       21
##   talk.religion.misc                       25

Look at the coefficients from the model; perhaps use lambda_num = 30. Use the code from Project 2 to look at the positive (and negative, if there are any) terms associated with each category. Do the terms seem to correspond to the categories in an expected way? Note: You can pipe the whole thing into the function View() if you want a better way to look at the output in RStudio.

# Question 03
dsst_coef(model$model, lambda_num = 30, to_tibble = TRUE) %>%
  filter(term != "(Intercept)") %>%
  pivot_longer(names_to = "label", values_to = "coef", cols = -c(term, MLN)) %>%
  filter(coef != 0) %>%
  mutate(direction = if_else(sign(coef) > 0, "positive", "negative")) %>%
  group_by(label, direction) %>%
  summarize(term = paste(term, collapse = " | ")) %>%
  pivot_wider(
    id_cols = "label",
    values_from = "term",
    names_from = "direction",
    values_fill = ""
  ) #%>%
## # A tibble: 20 × 2
## # Groups:   label [20]
##    label                    positive                                        
##    <chr>                    <chr>                                           
##  1 alt.atheism              atheist | jon | Koran | benedikt                
##  2 comp.graphics            graphic | computer                              
##  3 comp.os.ms-windows.misc  Windows | Ultra | 3.1                           
##  4 comp.sys.ibm.pc.hardware IDE | floppy | Lang | clone | bus               
##  5 comp.sys.mac.hardware    Mac | tech | PDS | upgrade | Apple | mac        
##  6 comp.windows.x           window | X11R5 | text                           
##  7 misc.forsale             sale | shipping | interested | condition | obo  
##  8 rec.autos                car                                             
##  9 rec.motorcycles          dod | ride | bike | DoD | Stafford | Winona | B…
## 10 rec.sport.baseball       pitcher | baseball | Baseball | Morris | battin…
## 11 rec.sport.hockey         playoff | hockey | game | Vancouver | play | Pe…
## 12 sci.crypt                encryption | key | Sternlight | Clipper | NSA   
## 13 sci.electronics          circuit | resistor | differential | ohm         
## 14 sci.med                  medical | therapy | medicine | Medicine | sympt…
## 15 sci.space                orbit | space | pat | 91109 | 3684 | 525 | awet…
## 16 soc.religion.christian   Christ | Bible | church | Christians | scriptur…
## 17 talk.politics.guns       firearm | weapon | gun | cult | shooting        
## 18 talk.politics.mideast    Israel | israeli | turkish | occupy | soviet | …
## 19 talk.politics.misc       Clayton | libertarian | situation               
## 20 talk.religion.misc       thrilling | BD
  #View()

Now, use the G-score metrics to find the 4 terms that are most associated with each category. Again, do these seem to match your intuition?

# Question 04
dsst_metrics(anno, docs) %>%
  group_by(label) %>%
  slice_head(n = 4)
## # A tibble: 80 × 8
## # Groups:   label [20]
##    train_id label                   token count expec…¹ count…² gscore  chi2
##    <chr>    <chr>                   <chr> <int>   <dbl>   <int>  <dbl> <dbl>
##  1 train    alt.atheism             "be"    442 169.       7660   322.  459.
##  2 train    alt.atheism             "\""    192  44.9      2032   278.  496.
##  3 train    alt.atheism             ","     416 184.       8329   228.  306.
##  4 train    alt.atheism             "."     353 176.       7968   145.  186.
##  5 train    comp.graphics           ","     672 276.       8329   437.  602.
##  6 train    comp.graphics           "-"     280  64.8      1957   418.  743.
##  7 train    comp.graphics           "ima…    60   2.81       85   308. 1203.
##  8 train    comp.graphics           "dat…    61   4.14      125   247.  809.
##  9 train    comp.os.ms-windows.misc "I"     220  42.8      3183   382.  751.
## 10 train    comp.os.ms-windows.misc "Win…    30   0.564      42   209. 1557.
## # … with 70 more rows, and abbreviated variable names ¹​expected,
## #   ²​count_word

Unsupervised Learning

Let’s move on to new material. Compute the first two principal components of the categories. Remember to set the document variable to “label”.

# Question 05
anno %>%
  inner_join(docs, by = "doc_id") %>%
  dsst_pca(doc_var = "label")
## # A tibble: 20 × 3
##    label                         v1      v2
##    <chr>                      <dbl>   <dbl>
##  1 alt.atheism               0.254   0.293 
##  2 comp.graphics            -0.252   0.188 
##  3 comp.os.ms-windows.misc  -0.330   0.286 
##  4 comp.sys.ibm.pc.hardware -0.369   0.265 
##  5 comp.sys.mac.hardware    -0.321   0.216 
##  6 comp.windows.x           -0.257   0.248 
##  7 misc.forsale             -0.258   0.0577
##  8 rec.autos                -0.0191 -0.118 
##  9 rec.motorcycles          -0.0150 -0.132 
## 10 rec.sport.baseball        0.0372 -0.337 
## 11 rec.sport.hockey          0.0419 -0.329 
## 12 sci.crypt                 0.0249  0.0980
## 13 sci.electronics          -0.146   0.0430
## 14 sci.med                   0.0552 -0.0248
## 15 sci.space                -0.0200 -0.0580
## 16 soc.religion.christian    0.310   0.359 
## 17 talk.politics.guns        0.225   0.163 
## 18 talk.politics.mideast     0.177   0.0805
## 19 talk.politics.misc        0.244   0.138 
## 20 talk.religion.misc        0.353   0.407

Plot (in R) the first two principal components of the categories. Add labels using a text repel layer. Try to find some of the document pairs in the PCA plot.

# Question 06
anno %>%
  inner_join(docs, by = "doc_id") %>%
  dsst_pca(doc_var = "label") %>%
  ggplot(aes(v1, v2)) +
    geom_point() +
    geom_text_repel(aes(label = label))

Now, produce a corresponding UMAP plot. Is this easier or more difficult to interpret?

# Question 07
anno %>%
  inner_join(docs, by = "doc_id") %>%
  dsst_umap(doc_var = "label") %>%
  ggplot(aes(v1, v2)) +
    geom_point() +
    geom_text_repel(aes(label = label))

Next, produce the principal components for the messages themselves. Save the results as a JSON file and go to the link below to visualize the results. Color the points based on the labels.

# Question 08
dsst_pca(anno) %>% dsst_json_drep(docs, color_var = "label")

Repeat the last question for the UMAP parameters. Did you find any interesting clusters of documents?

# Question 09
dsst_umap(anno) %>% dsst_json_drep(docs, color_var = "label")

Make sure to not rush through this step; take a couple minutes to pan around in the embedding space.