As mentioned in class, you do not need to read the notes for today ahead of time. We will discuss them together in class.

Load the Data

For today’s notes we will work with one of the data sets from Project 2, specifically the reviews of Music CDs.

docs <- read_csv("../data/amazon_cds.csv.bz2")
anno <- read_csv("../data/amazon_cds_token.csv.bz2")

Today we make a shift towards a different type of analysis. So far we have been mostly interested in a large collection of texts with the goal of associating textual features with one or more labels. Here, we start to consider the case where we are interested in understanding specific documents in a corpus.

As we talked about on the first day of class, there are three broad classes of model types. So far we have worked with supervised (i.e., predictive) learning, in which we have a relatively small set of categories, each with many examples. We wanted to find trends to predict the category based on common properties of the examples.

Now, we want to introduce methods for unsupervised learning. In unsupervised models, we are interested in each document individually and want to understand how it relates to all of the other documents. Often unsupervised learning is used as a first step in supervised learning, but it can be, and often is, used on its own.

Note: Please be patient with the notes today. I think the best approach to introducing the material starts by showing a few techniques in a way that does not have an immediate pay-off. We will get to them in a few minutes!

TF-IDF

Most of models that we have built so far use the term frequencies (TF) to make predictions. These are counts of how often individual terms (usually lemmas, but these could be counts of parts of speech or other things) are used in each document.

As we move into unsupervised learning, we will see that it is important to consider modifying this object to scale the entries to account for the overall frequency of terms across the corpus. In other words, without the guidance of a predictive task to tell us how much to weight each term frequency, we need to apply a manually scaling method to help account for this.

The scaling that we will use results in a technique called TF-IDF (term frequency-inverse document frequency). It creates features which are a scaled version of the term frequencies divided by a scaled weight of how often the term is used in other documents. Mathematically, if tf are the number of times a term is used in a document, df are the number of documents that use the term at least once, and N are the total number of document, a TF-IDF score can be computed as:

\[ \text{tfidf} = (1 + log_2(\text{tf})) \times log_2(\text{N} / \text{df}) \]

The score gives a measurement of how important a term is in describing a document in the context of the other documents. Note that this is a popular choice for the scaling functions, but they are not universal and other choices are possible.

In addition to using TF-IDF as the first step in other unsupervised algorithms, which we will see below, we can also TF-IDF to try to measure the most important words in each document by finding the terms that have the highest score. We can do this using the dsst_tfidf function:

dsst_tfidf(anno)
## # A tibble: 6,250 × 2
##    doc_id   tokens                                 
##    <chr>    <chr>                                  
##  1 doc00001 "start; melody; cd; around; end"       
##  2 doc00002 "band; /; track; show; my"             
##  3 doc00003 "enough; track; \"; band; year"        
##  4 doc00004 "single; see; cover; classic; !"       
##  5 doc00005 "course; ;; bring; \"; track"          
##  6 doc00006 "off; single; look; line; nothing"     
##  7 doc00007 "release; band; bring; then; start"    
##  8 doc00008 "tune; release; back; feature; too"    
##  9 doc00009 "big; see; end; then; your"            
## 10 doc00010 "bring; follow; version; its; original"
## # … with 6,240 more rows

Note that we do not need to use this function itself any of the following applications. The TF-IDF code is already embedded into the implementations of other unsupervised tasks.

Before moving on, if you recall, I mentioned that we used a scaled version of the term-frequencies in the KNN estimator and we would return to that concept. The TF-IDF matrix is what was being used to generate nearest neighbours. In the next section, we will again use the TF-IDF scores to compute distances, but in a slightly different way.

Principal component analysis (PCA)

Now, let’s consider one of the most important types of unsupervised learning: dimensionality reduction. Here, the goal is to take data that has a large number of features and map it into a much smaller set of features. We want as much structure as possible to be retained in the new set of features; in the process, however, we will lose the interpretability of the individual features.

Principal component analysis (PCA) is a common method for taking a high-dimensional data set and converting it into a smaller set of dimensions that capture many of the most interesting aspects of the higher dimensional space. Specifically, it tries to preserve all of the (Euclidean) distances between pairs of points in a smaller dimensional space. For example, if we used a two-dimensional PCA, this would try to arrange the points in a plane in a way such that they have distances that are similar as possible to in the larger space.

An interesting trait of PCA is that the best way to arrange points in K dimensions ends up being a linear subspace of the space that best arranges points in K-1 dimensions. Why is this important? It means that we can define specific quantities called principal components in a way such that the PCA mapping into K dimensional space is simply the first K principal components.

We can compute principal components of the TF-IDF features using the function dsst_pca. By default we include only two components; more can be selected with the parameter n_dims.

dsst_pca(anno)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
## # A tibble: 6,250 × 3
##    doc_id        v1        v2
##    <chr>      <dbl>     <dbl>
##  1 doc00001 -0.0216  0.00684 
##  2 doc00002 -0.0233  0.00429 
##  3 doc00003 -0.0242  0.000205
##  4 doc00004 -0.0115  0.00199 
##  5 doc00005 -0.0264  0.00206 
##  6 doc00006 -0.0210  0.00837 
##  7 doc00007 -0.0199  0.00798 
##  8 doc00008 -0.0171  0.00324 
##  9 doc00009 -0.0269  0.00701 
## 10 doc00010 -0.0230 -0.00538 
## # … with 6,240 more rows

What can we do with these components? One thing we can do is plot the first two components to show the relationship between the documents within the high dimensional space:

dsst_pca(anno) %>% 
  ggplot(aes(v1, v2)) +
    geom_point()

Again, let’s hold on one a moment and we will get to why this all matters in a moment.

UMAP

A similar, slightly more complex, method for doing dimensional reduction is called Uniform Manifold Approximation and Projection (UMAP). I will refer you to the original paper for a formal description; in short, it tries to preserve the nearest neighbors of points while not being concerned with their specific distances. We can compute the embedding with the dsst_umap function:

anno %>%
  filter(upos %in% c("NOUN", "ADJ", "VERB")) %>%
  dsst_umap()
## # A tibble: 6,250 × 3
##    doc_id        v1    v2
##    <chr>      <dbl> <dbl>
##  1 doc00001 -5.00   -1.22
##  2 doc00002 -0.0787 -2.54
##  3 doc00003  0.570  -1.56
##  4 doc00004  1.74   -1.67
##  5 doc00005  0.812  -1.05
##  6 doc00006  0.571  -2.00
##  7 doc00007  0.804  -1.99
##  8 doc00008 -0.943  -1.40
##  9 doc00009  0.497  -2.06
## 10 doc00010  0.815  -1.45
## # … with 6,240 more rows

As with the principal components, the exact value are unimportant. It is the relationship between the documents that counts. For example, we can make a plot similar to the previous one:

anno %>%
  filter(upos %in% c("NOUN", "ADJ", "VERB")) %>%
  dsst_umap() %>% 
  ggplot(aes(v1, v2)) +
    geom_point()

Notice that shape of the dimensional reduction results are noticeably different from those of the PCA analysis.

Visualise the Embedding

The result of a dimensionality reduction algorithm such as PCA or UMAP analysis, are known as embeddings. The shapes can be interesting to look at but they do not immediately tell us much about the space. The key is that we need a new way of visualising the output in an interactive way beyond what is possible directly in R. Instead, let’s save the results of the embedding as a json file and visualise it using a tool written in JavaScript.

To create the data to load, run the following function (note the two options that control the output):

dsst_pca(anno) %>%
  dsst_json_drep(docs, color_var = "label", title_vars = c("doc_id", "label"))

By default, it saves the result as a file named “dim_reduction.json” contained in your class output directory. Now, go to the website Text Embedding Visualizer and then upload the JSON file that we just produced. It should create an interactive visualization for you to explore the dataset in.

What is a Document?

In addition to making an interactive visualisation, another way that we can make the results of unsupervised learning more interesting is to change what we consider a document. The documents in this example by default are individual product reviews. However, we could collect the tokens across another metadata category. For example, we could consider all of a users reviews to collectively be a single “document”. To change this, we need to combine the annotations data with the docs table and then change the doc_var variable in the call to the pca function:

anno %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "label") %>% 
  ggplot(aes(v1, v2)) +
    geom_point() +
    geom_text_repel(aes(label = label))

We can, for example, understand the difference structure now between the PCA and UMAP embeddings by comparing them here:

anno %>%
  filter(upos %in% c("NOUN", "ADJ", "VERB")) %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_umap(doc_var = "label") %>% 
  ggplot(aes(v1, v2)) +
    geom_point() +
    geom_text_repel(aes(label = label))

Notice that we start to see clusters of users in the output. How could these have been useful in your previous project?