For today’s notes we will work with one of the data sets from Project 2, specifically the reviews of Music CDs.
<- read_csv("../data/amazon_cds.csv.bz2")
docs <- read_csv("../data/amazon_cds_token.csv.bz2") anno
Today, we will continue are study of unsupervised learning by studying a few extensions of what we saw in the previous notes.
One thing that we often do when looking at plots of dimensionality reduction is to look for clumps of documents that co-occur. We can do this explicit by clustering the data using a clustering algorithm. Here, we will use a popular option called K-means (note, this is not the same as k-NN, but I suppose it is not entirely unrelated). It is an iterative algorithm that works as follows:
The algorithm is not entirely deterministic because of the random starting points. Typically, the algorithm is run several times and the “best” clustering is chosen. How do we define the “best” in the case of a clustering algorithm? A typical method is to measure the sum of squared distances to the cluster centroids, a quantity that a good clustering will minimise.
Usually, K-means is run on a set of PCA coordinates rather than the entire embedding itself. Often you will find that including a few dozen PCA components provides a better fit (though it is rarely useful to use the entire TF-IDF).
To apply K-means, we will use the dsst_kmeans
function.
We need to set the number of clusters; it is also possible to change the
number of PCA dimensions.
%>%
anno filter(upos %in% c("NOUN", "ADJ", "VERB")) %>%
inner_join(select(docs, -text), by = "doc_id") %>%
dsst_pca(doc_var = "label", n_dims = 2) %>%
dsst_kmeans(n_clusters = 5L)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
## # A tibble: 25 × 4
## label v1 v2 cluster
## <chr> <dbl> <dbl> <dbl>
## 1 U01 -0.0446 -0.0618 4
## 2 U02 -0.259 0.00322 1
## 3 U03 -0.347 0.0331 1
## 4 U04 0.0204 -0.144 5
## 5 U05 0.0113 -0.327 3
## 6 U06 -0.425 -0.00215 2
## 7 U07 0.0163 -0.171 5
## 8 U08 -0.436 -0.0210 2
## 9 U09 -0.0402 -0.196 5
## 10 U10 -0.0958 -0.0912 4
## # … with 15 more rows
Notice that these closely compare to the clusters you would see by grouping nearby points in the PCA plot from last time.
This allows us to use the data to build other visualizations:
%>%
anno filter(upos %in% c("NOUN", "ADJ", "VERB")) %>%
inner_join(select(docs, -text), by = "doc_id") %>%
dsst_pca(doc_var = "label", n_dims = 2) %>%
dsst_kmeans(n_clusters = 5) %>%
ggplot(aes(v1, v2, color = factor(cluster))) +
geom_point() +
geom_text_repel(aes(label = label), show.legend = FALSE) +
theme_void()
Another method for clustering analysis is to use a hierarchical approach. We start by computing the nearest neighbor to each data point. Then, we combine the two points that are closest together into a single cluster. Next, we recompute the nearest neighbors using the center of the two combined points in place of the individual points. This continues until all of the data are in a single cluster.
To run hierarchical clustering, we can use the
dsst_hclust
function:
<- anno %>%
model filter(upos %in% c("NOUN", "ADJ", "VERB")) %>%
inner_join(select(docs, -text), by = "doc_id") %>%
dsst_pca(doc_var = "label", n_dims = 2) %>%
dsst_hclust()
The hierarchy of clusters can be seen by plotting the model:
plot(model)
We can create a concrete set of clusters by picking the number of clusters or the height of the tree to cut at:
dsst_hclust_cut(model, nclust = 10)
## # A tibble: 25 × 2
## document cluster
## <chr> <dbl>
## 1 U01 1
## 2 U02 2
## 3 U03 3
## 4 U23 3
## 5 U04 4
## 6 U07 4
## 7 U11 4
## 8 U21 4
## 9 U22 4
## 10 U25 4
## # … with 15 more rows
Hierarchical clustering works well for a small number of documents but can become difficult to use when working with a larger set of documents.
In the preceding analyses, we have focused on the analysis of the document their usage of words. It turns out that it is also possible to apply dimensional reduction and distance metrics by swapping the words and the documents. All of the concepts work the same, but now we have term frequencies tell us how often each word is used in every document.
I included an easy-to-use argument to the dsst_pca
function to compute the PCA fit of this transposed analysis (as well as
UMAP):
dsst_pca(anno, invert = TRUE)
## # A tibble: 100 × 3
## doc_id v1 v2
## <chr> <dbl> <dbl>
## 1 to 0.168 -0.0313
## 2 this 0.138 0.119
## 3 in 0.145 -0.195
## 4 it 0.156 0.0813
## 5 that 0.159 0.00803
## 6 not 0.152 0.0571
## 7 have 0.142 -0.0369
## 8 with 0.147 -0.0566
## 9 on 0.127 -0.0805
## 10 I 0.153 0.157
## # … with 90 more rows
As well, we can compute clusters of words:
%>%
anno filter(upos %in% c("NOUN")) %>%
dsst_pca(n_dims = 15L, invert = TRUE) %>%
dsst_kmeans(n_clusters = 5L) %>%
group_by(cluster) %>%
summarize(words = paste(doc_id, collapse = "; "))
## # A tibble: 5 × 2
## cluster words
## <dbl> <chr>
## 1 1 time; way; thing; day; love; line; world
## 2 2 song; music; work; version; hit; tune; style; pop; number; lot; c…
## 3 3 album; track; year; release; title; side
## 4 4 sound; recording; performance; one; voice; note; quality; singer;…
## 5 5 cd; band; fan; rock; guitar; bit; vocal; disc; set; melody
Do you see any other interesting patterns here?
Hopefully you are starting to see the possibilities of things that we can do with this kind of method. For example, let’s cluster the nouns, verbs, and adjectives:
<- anno %>%
df filter(upos %in% c("NOUN", "VERB", "ADJ")) %>%
mutate(lemma = stri_paste(lemma, upos, sep = "_")) %>%
dsst_pca(n_dims = 15L, invert = TRUE) %>%
dsst_kmeans(n_clusters = 5L)
df
## # A tibble: 100 × 17
## doc_id v1 v2 v3 v4 v5 v6 v7 v8
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 have_… 0.180 -0.00977 0.0147 -0.0176 0.0234 0.0949 4.46e-2 -0.0163
## 2 good_… 0.133 -0.0604 0.0396 -0.0469 -0.116 0.0128 -4.27e-2 -0.177
## 3 album… 0.137 -0.195 -0.111 0.131 0.167 0.0411 -6.26e-2 -0.0997
## 4 song_… 0.164 -0.171 -0.0697 0.0335 -0.101 0.0772 -1.84e-1 0.131
## 5 music… 0.107 0.152 0.0497 -0.0278 0.212 -0.158 -1.54e-1 0.137
## 6 make_… 0.164 0.00980 -0.00268 -0.0529 0.0387 0.0329 3.73e-2 0.0589
## 7 be_VE… 0.134 0.125 0.0455 0.00135 -0.0419 0.0255 7.79e-2 0.0344
## 8 do_VE… 0.144 -0.0754 0.0707 -0.119 -0.0215 -0.0574 1.08e-1 0.0344
## 9 great… 0.117 -0.0479 -0.00495 -0.248 0.00778 0.0102 5.49e-4 -0.0778
## 10 time_… 0.150 0.00945 -0.00249 0.104 0.0473 -0.0395 1.20e-1 0.0521
## # … with 90 more rows, and 8 more variables: v9 <dbl>, v10 <dbl>,
## # v11 <dbl>, v12 <dbl>, v13 <dbl>, v14 <dbl>, v15 <dbl>, cluster <dbl>
And then plot them according to the part of speech:
%>%
df mutate(upos = stri_extract(doc_id, regex = "[^_][A-Z]+\\Z")) %>%
mutate(cluster = factor(cluster)) %>%
ggplot(aes(v1, v2, color = upos)) +
geom_text(aes(label = doc_id), show.legend = FALSE) +
theme_void()