Load the Data

For today’s notes we will work with one of the data sets from Project 2, specifically the reviews of Music CDs.

docs <- read_csv("../data/amazon_cds.csv.bz2")
anno <- read_csv("../data/amazon_cds_token.csv.bz2")

Today, we will continue are study of unsupervised learning by studying a few extensions of what we saw in the previous notes.

Clusters with K-means

One thing that we often do when looking at plots of dimensionality reduction is to look for clumps of documents that co-occur. We can do this explicit by clustering the data using a clustering algorithm. Here, we will use a popular option called K-means (note, this is not the same as k-NN, but I suppose it is not entirely unrelated). It is an iterative algorithm that works as follows:

  1. Pick the number of clusters N that you want to detect.
  2. Randomly choose N data points as the starting centroids of each cluster.
  3. Compute the distance of every data point to the centroids of the current clusters.
  4. Assign each data point to the cluster whose centroids it is closest.
  5. Re-compute the cluster centroids as the average value of all the points in a cluster.
  6. Take the new cluster centroids, and repeat the process (compute distances, reassign to groups, and recompute the centroids) iteratively until convergence.

The algorithm is not entirely deterministic because of the random starting points. Typically, the algorithm is run several times and the “best” clustering is chosen. How do we define the “best” in the case of a clustering algorithm? A typical method is to measure the sum of squared distances to the cluster centroids, a quantity that a good clustering will minimise.

Usually, K-means is run on a set of PCA coordinates rather than the entire embedding itself. Often you will find that including a few dozen PCA components provides a better fit (though it is rarely useful to use the entire TF-IDF).

To apply K-means, we will use the dsst_kmeans function. We need to set the number of clusters; it is also possible to change the number of PCA dimensions.

anno %>%
  filter(upos %in% c("NOUN", "ADJ", "VERB")) %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label", n_dims = 2) %>%
  dsst_kmeans(n_clusters = 5L)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
## # A tibble: 25 × 4
##    label      v1       v2 cluster
##    <chr>   <dbl>    <dbl>   <dbl>
##  1 U01   -0.0446 -0.0618        4
##  2 U02   -0.259   0.00322       1
##  3 U03   -0.347   0.0331        1
##  4 U04    0.0204 -0.144         5
##  5 U05    0.0113 -0.327         3
##  6 U06   -0.425  -0.00215       2
##  7 U07    0.0163 -0.171         5
##  8 U08   -0.436  -0.0210        2
##  9 U09   -0.0402 -0.196         5
## 10 U10   -0.0958 -0.0912        4
## # … with 15 more rows

Notice that these closely compare to the clusters you would see by grouping nearby points in the PCA plot from last time.

This allows us to use the data to build other visualizations:

anno %>%
  filter(upos %in% c("NOUN", "ADJ", "VERB")) %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label", n_dims = 2) %>%
  dsst_kmeans(n_clusters = 5) %>%
  ggplot(aes(v1, v2, color = factor(cluster))) +
    geom_point() +
    geom_text_repel(aes(label = label), show.legend = FALSE) +
    theme_void()

Hierarchical Clustering

Another method for clustering analysis is to use a hierarchical approach. We start by computing the nearest neighbor to each data point. Then, we combine the two points that are closest together into a single cluster. Next, we recompute the nearest neighbors using the center of the two combined points in place of the individual points. This continues until all of the data are in a single cluster.

To run hierarchical clustering, we can use the dsst_hclust function:

model <- anno %>%
  filter(upos %in% c("NOUN", "ADJ", "VERB")) %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label", n_dims = 2) %>%
  dsst_hclust()

The hierarchy of clusters can be seen by plotting the model:

plot(model)

We can create a concrete set of clusters by picking the number of clusters or the height of the tree to cut at:

dsst_hclust_cut(model, nclust = 10)
## # A tibble: 25 × 2
##    document cluster
##    <chr>      <dbl>
##  1 U01            1
##  2 U02            2
##  3 U03            3
##  4 U23            3
##  5 U04            4
##  6 U07            4
##  7 U11            4
##  8 U21            4
##  9 U22            4
## 10 U25            4
## # … with 15 more rows

Hierarchical clustering works well for a small number of documents but can become difficult to use when working with a larger set of documents.

Word Relationships

In the preceding analyses, we have focused on the analysis of the document their usage of words. It turns out that it is also possible to apply dimensional reduction and distance metrics by swapping the words and the documents. All of the concepts work the same, but now we have term frequencies tell us how often each word is used in every document.

I included an easy-to-use argument to the dsst_pca function to compute the PCA fit of this transposed analysis (as well as UMAP):

dsst_pca(anno, invert = TRUE)
## # A tibble: 100 × 3
##    doc_id    v1       v2
##    <chr>  <dbl>    <dbl>
##  1 to     0.168 -0.0313 
##  2 this   0.138  0.119  
##  3 in     0.145 -0.195  
##  4 it     0.156  0.0813 
##  5 that   0.159  0.00803
##  6 not    0.152  0.0571 
##  7 have   0.142 -0.0369 
##  8 with   0.147 -0.0566 
##  9 on     0.127 -0.0805 
## 10 I      0.153  0.157  
## # … with 90 more rows

As well, we can compute clusters of words:

anno %>%
  filter(upos %in% c("NOUN")) %>%
  dsst_pca(n_dims = 15L, invert = TRUE) %>%
  dsst_kmeans(n_clusters = 5L) %>%
  group_by(cluster) %>%
  summarize(words = paste(doc_id, collapse = "; "))
## # A tibble: 5 × 2
##   cluster words                                                             
##     <dbl> <chr>                                                             
## 1       1 time; way; thing; day; love; line; world                          
## 2       2 song; music; work; version; hit; tune; style; pop; number; lot; c…
## 3       3 album; track; year; release; title; side                          
## 4       4 sound; recording; performance; one; voice; note; quality; singer;…
## 5       5 cd; band; fan; rock; guitar; bit; vocal; disc; set; melody

Do you see any other interesting patterns here?

Hopefully you are starting to see the possibilities of things that we can do with this kind of method. For example, let’s cluster the nouns, verbs, and adjectives:

df <- anno %>%
  filter(upos %in% c("NOUN", "VERB", "ADJ")) %>%
  mutate(lemma = stri_paste(lemma, upos, sep = "_")) %>%
  dsst_pca(n_dims = 15L, invert = TRUE) %>%
  dsst_kmeans(n_clusters = 5L)
df
## # A tibble: 100 × 17
##    doc_id    v1       v2       v3       v4       v5      v6       v7      v8
##    <chr>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>   <dbl>
##  1 have_… 0.180 -0.00977  0.0147  -0.0176   0.0234   0.0949  4.46e-2 -0.0163
##  2 good_… 0.133 -0.0604   0.0396  -0.0469  -0.116    0.0128 -4.27e-2 -0.177 
##  3 album… 0.137 -0.195   -0.111    0.131    0.167    0.0411 -6.26e-2 -0.0997
##  4 song_… 0.164 -0.171   -0.0697   0.0335  -0.101    0.0772 -1.84e-1  0.131 
##  5 music… 0.107  0.152    0.0497  -0.0278   0.212   -0.158  -1.54e-1  0.137 
##  6 make_… 0.164  0.00980 -0.00268 -0.0529   0.0387   0.0329  3.73e-2  0.0589
##  7 be_VE… 0.134  0.125    0.0455   0.00135 -0.0419   0.0255  7.79e-2  0.0344
##  8 do_VE… 0.144 -0.0754   0.0707  -0.119   -0.0215  -0.0574  1.08e-1  0.0344
##  9 great… 0.117 -0.0479  -0.00495 -0.248    0.00778  0.0102  5.49e-4 -0.0778
## 10 time_… 0.150  0.00945 -0.00249  0.104    0.0473  -0.0395  1.20e-1  0.0521
## # … with 90 more rows, and 8 more variables: v9 <dbl>, v10 <dbl>,
## #   v11 <dbl>, v12 <dbl>, v13 <dbl>, v14 <dbl>, v15 <dbl>, cluster <dbl>

And then plot them according to the part of speech:

df %>%
  mutate(upos = stri_extract(doc_id, regex = "[^_][A-Z]+\\Z")) %>%
  mutate(cluster = factor(cluster)) %>%
  ggplot(aes(v1, v2, color = upos)) +
    geom_text(aes(label = doc_id), show.legend = FALSE) +
    theme_void()