Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options message=FALSE and echo=FALSE to avoid cluttering your solutions with all the output from this code.

Reading the Data

Today we will again look at a subset of a well-known text analysis corpus call NewsGroups-20.

docs <- read_csv("../data/newsgroups.csv.bz2")
anno <- read_csv("../data/newsgroups_token.csv.bz2")

Questions

Clustering

Use K-means to cluster the categories with K equal to 5 (that’s the default) and 2 PCA dimensions (that’s also the default).

# Question 01
anno %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label", n_dims = 2) %>%
  dsst_kmeans(n_clusters = 5L)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
## # A tibble: 20 × 4
##    label                         v1      v2 cluster
##    <chr>                      <dbl>   <dbl>   <dbl>
##  1 alt.atheism               0.254   0.293        4
##  2 comp.graphics            -0.252   0.188        1
##  3 comp.os.ms-windows.misc  -0.330   0.286        1
##  4 comp.sys.ibm.pc.hardware -0.369   0.265        1
##  5 comp.sys.mac.hardware    -0.321   0.216        1
##  6 comp.windows.x           -0.257   0.248        1
##  7 misc.forsale             -0.258   0.0577       1
##  8 rec.autos                -0.0191 -0.118        2
##  9 rec.motorcycles          -0.0150 -0.132        2
## 10 rec.sport.baseball        0.0372 -0.337        5
## 11 rec.sport.hockey          0.0419 -0.329        5
## 12 sci.crypt                 0.0249  0.0980       2
## 13 sci.electronics          -0.146   0.0430       2
## 14 sci.med                   0.0552 -0.0248       2
## 15 sci.space                -0.0200 -0.0580       2
## 16 soc.religion.christian    0.310   0.359        4
## 17 talk.politics.guns        0.225   0.163        3
## 18 talk.politics.mideast     0.177   0.0805       3
## 19 talk.politics.misc        0.244   0.138        3
## 20 talk.religion.misc        0.353   0.407        4

Now, plot these cluster using color on a plot with the first two principal components. Make sure to force R to treat the color as a categorical variable. Visually verify that the clusters make visual sense.

# Question 02
anno %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label", n_dims = 2) %>%
  dsst_kmeans(n_clusters = 5L) %>%
  ggplot(aes(v1, v2)) +
    geom_point(aes(color = factor(cluster)))

Now, run hierarchical clustering on the categories. Save the model as an R object.

# Question 03
model <- anno %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label", n_dims = 2) %>%
  dsst_hclust()

Plot the hierarchical clustering and notice what additional information it gives compared to K-means.

# Question 04
plot(model)

Now, plot the colors of the hierarchical clusters on a plot with the first two principal components (this requires a bit more code) using 5 clusters. Take note of any differences in relation to the kmeans clustering.

# Question 05
anno %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label", n_dims = 2) %>%
  inner_join(dsst_hclust_cut(model, nclust = 5), by = c("label" = "document")) %>%
  ggplot(aes(v1, v2)) +
    geom_point(aes(color = factor(cluster)))

Repeat the previous question, but this time use only 3 clusters. Take note of how the clusters are grouped together relative to the previous plot.

# Question 06
anno %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label", n_dims = 2) %>%
  inner_join(dsst_hclust_cut(model, nclust = 3), by = c("label" = "document")) %>%
  ggplot(aes(v1, v2)) +
    geom_point(aes(color = factor(cluster)))

Exploratory Analysis

Now, let’s do a few more involved tasks with the data using the clusters. In the code below, group the newsgroups into 5 clusters using kmeans, including only nouns and verbs. Save the result as a data frame called clusters.

# Question 07
clusters <- anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label", n_dims = 2) %>%
  dsst_kmeans(n_clusters = 5L)

Next, compute the 5 words most associated (according the G-scores) with the validation set of each cluster. Again, only use nouns and verbs. Join these top terms together using the paste function.

# Question 08
docs_new <- docs %>%
  inner_join(clusters, by = "label")

anno %>%
    filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_metrics(docs_new, label_var = "cluster") %>%
  filter(train_id == "valid") %>%
  group_by(label) %>%
  arrange(desc(gscore)) %>%  # actually unneeded since already sorted this way
  slice_head(n = 5) %>%
  summarize(top_terms = paste(token, collapse = " | "))
## # A tibble: 5 × 2
##   label top_terms                                 
##   <dbl> <chr>                                     
## 1     1 game | team | play | suck | player        
## 2     2 drive | have | monitor | board | card     
## 3     3 _ | image | file | color | wire           
## 4     4 say | people | law | church | believe     
## 5     5 mission | key | encryption | request | car

Next, fit an elastic net model to predict the cluster from which each message come from. Again, only use nouns and verbs. Produce a confusion matrix of the validation data.

# Question 09
model <- anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_enet_build(docs_new, label_var = "cluster")

model$docs %>%
  filter(train_id == "valid") %>%
  select(cluster, pred_label) %>%
  table()
##        pred_label
## cluster   1   2   3   4   5
##       1  36   0   9  53   2
##       2   0  23  40  34   3
##       3   2  11 137  80  20
##       4   0   1  12 268  19
##       5   1   2  23 115 109

Finally, using the same model output, take the docs object returned by the model and create a new column called new_label that is equal to the cluster pasted together with the label. Build something similar to a confusion matrix for the validation data, with the new_label on the rows and the predicted cluster (its called pred_label) on the columns. Does this show that certain newsgroups are to blame for a large share of the confused terms? In what ways?

# Question 10
model$docs %>%
  filter(train_id == "valid") %>%
  mutate(new_label = paste(cluster, label)) %>%
  select(new_label, pred_label) %>%
  table()
##                             pred_label
## new_label                     1  2  3  4  5
##   1 rec.sport.baseball       14  0  8 27  1
##   1 rec.sport.hockey         22  0  1 26  1
##   2 comp.sys.ibm.pc.hardware  0 14 19 15  2
##   2 comp.sys.mac.hardware     0  9 21 19  1
##   3 comp.graphics             0  1 27 19  3
##   3 comp.os.ms-windows.misc   0  4 27 15  4
##   3 comp.windows.x            0  3 32 13  2
##   3 misc.forsale              2  0 33 11  4
##   3 sci.electronics           0  3 18 22  7
##   4 alt.atheism               0  0  1 45  4
##   4 soc.religion.christian    0  0  3 44  3
##   4 talk.politics.guns        0  0  1 45  4
##   4 talk.politics.mideast     0  0  1 47  2
##   4 talk.politics.misc        0  1  4 40  5
##   4 talk.religion.misc        0  0  2 47  1
##   5 rec.autos                 1  0  6 17 26
##   5 rec.motorcycles           0  0  1 13 36
##   5 sci.crypt                 0  1  6 21 22
##   5 sci.med                   0  1  4 36  9
##   5 sci.space                 0  0  6 28 16

Words

Produce a PCA plot of the nouns, using the option invert = TRUE. You will want to adjust the parameter min_df to be 0.01.

# Question 11
anno %>%
  filter(upos %in% c("NOUN")) %>%
  dsst_pca(invert = TRUE, min_df = 0.01) %>%
  ggplot(aes(v1, v2)) +
    geom_text(aes(label = doc_id))

Produce a PCA plot of the verbs, using the option invert = TRUE. You will want to adjust the parameter min_df to be 0.01.

# Question 12
anno %>%
  filter(upos %in% c("VERB")) %>%
  dsst_pca(invert = TRUE, min_df = 0.01) %>%
  ggplot(aes(v1, v2)) +
    geom_text(aes(label = doc_id))

What interesting patterns do you see in the previous two plots?