As mentioned in class, you do not need to read the notes for today ahead of time. We will discuss them together in class. Please spend the extra time working on your Project 3.

Load Wikipedia Data

We will start by looking at a collection of Wikipedia pages related to sovereign states (i.e., countries):

docs <- read_csv(file.path("..", "data", "wiki_list_of_sovereign_states.csv"))
anno <- read_csv(file.path("..", "data", "wiki_list_of_sovereign_states_anno.csv.gz"))

Today we will use these to illustrate a new technique for understanding a large collection of documents.

Latent Dirchlet Allocation: Method

Now, we will investigate a method for topic modeling. This is an unsupervised task that seeks to identify topics within a corpus of text. What exactly is a topic? Mathematically speaking, it is usually defined as a probability distribution over a collection of words. Words that have a high probability within a topic tend to classify the topics themes in a colloquial sense. For example, a topic that captures the idea of baseball would have high probabilities on words such as “base”, “player”, “strike”, “umpire”, “team”, and so forth.

We will be use a model today called Latent Dirchlet Allocation, or more commonly LDA. Given a fixed number of topics and fixed set of words (called a lexicon), LDA assumes that documents consist of a random collection of words constructed according to the following model:

  1. Each document is randomly partitioned into topics. For example, one document may be 20% in Topic A, 70% in Topic B, and 1% in the remaining 10 topics.
  2. Each topic is similarly assigned as a probability distribution of all the available words.
  3. When choosing words to create a document, pick a topic at random proportional to the topic distribution of the document, and then pick a word proportional to the chosen topic.
  4. The number of words in each document is assumed to be fixed, and there is assumed to be no relationship between the words in each document.

This model is a great example of the adage that “all models are wrong, but some are useful”. Clearly, this is not how documents are constructed, and words are not independent of one another. However, the approximation is close enough to produce a useful abstraction for detecting themes within a corpus of textual documents.

You will notice that the description above is in some ways backwards from reality. It assumes that we know the distribution of the topics over the words and documents but do not know what words are present in the documents. In fact, we know the words but not the topics! This is an example of a Bayesian model. If we wrote down the assumptions rigorously, we could invert the probabilities using Bayes’ Theorem. That is, instead of knowing the probability of the documents given the topics, we can determine the probability of the topics given the documents. It is not possible to do this analytically, however, and a simulation method is needed to figure out what distribution of topics over the words and documents is most likely to have produced the observed data.

As a final note about the method, the name comes from the standard distribution used to determine the topics (the Dirichlet distribution) and the fact that the topics themselves are never observed (that is, the are latent).

Latent Dirchlet Allocation: Application

Now, let’s actually compute an LDA topic model using the function dsst_lda_build. It requires setting the number of topics; 16 is a good starting number. I tend to use only nouns, adjectives, adverbs, and verbs for the LDA model. The model can take a few minutes to finish; it should print out results as the algorithm proceeds.

model <- anno %>%
  filter(upos %in% c("NOUN", "VERBS", "ADJ", "ADV")) %>%
  dsst_lda_build(num_topics = 16)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead

The results in the object model are structured as data tables. You can use them directly if you want to do some direct EDA on the object. For example, here are the documents most associated with each topic:

model$docs %>%
  group_by(topic) %>%
  arrange(desc(prob)) %>%
  slice_head(n = 5) %>%
  summarise(countries = paste(doc_id, collapse = "; "))
## # A tibble: 16 × 2
##    topic countries                                                          
##    <int> <chr>                                                              
##  1     1 Guinea; Equatorial Guinea; Sierra Leone; Papua New Guinea; Liberia 
##  2     2 Canada; Russia; United States; Australia; North Korea              
##  3     3 Tuvalu; Cook Islands; Saint Kitts and Nevis; Kiribati; Federated S…
##  4     4 Saudi Arabia; Oman; Egypt; Bahrain; United Arab Emirates           
##  5     5 Kazakhstan; Kuwait; Niue; Iceland; Qatar                           
##  6     6 Bolivia; Colombia; Ecuador; El Salvador; Peru                      
##  7     7 Burundi; Uganda; Central African Republic; Eritrea; South Sudan    
##  8     8 United Kingdom; Republic of Ireland; Northern Cyprus; New Zealand;…
##  9     9 Malawi; Benin; Ghana; Ivory Coast; Senegal                         
## 10    10 Vatican City; Montenegro; Monaco; Andorra; San Marino              
## 11    11 Zambia; South Africa; Nepal; Botswana; Eswatini                    
## 12    12 Norway; Kingdom of the Netherlands; Sweden; Finland; Denmark       
## 13    13 Taiwan; South Korea; Japan; China; Philippines                     
## 14    14 Czech Republic; Greece; Germany; Hungary; Italy                    
## 15    15 Lebanon; Syria; Abkhazia; Israel; State of Palestine               
## 16    16 Spain; Mexico; Dominican Republic; Guatemala; Chile

Likewise, here are the words most associated with each topic:

model$terms %>%
  group_by(topic) %>%
  arrange(desc(beta)) %>%
  slice_head(n = 5) %>%
  summarise(words = paste(token, collapse = "; "))
## # A tibble: 16 × 2
##    topic words                                            
##    <int> <chr>                                            
##  1     1 guinea; coup; african; oil; french               
##  2     2 federal; russian; worlds; nuclear; percent       
##  3     3 island; british; sea; species; colony            
##  4     4 british; oil; arab; egyptian; woman              
##  5     5 tax; income; company; km; private                
##  6     6 Portuguese; indigenous; coast; species; José     
##  7     7 child; ethnic; opposition; african; woman        
##  8     8 cent; per; Turkish; british; irish               
##  9     9 french; african; slave; index; music             
## 10    10 ethnic; parliament; border; italian; french      
## 11    11 indian; british; species; african; southern      
## 12    12 parliament; king; municipality; county; immigrant
## 13    13 chinese; japanese; asian; island; dynasty        
## 14    14 european; modern; music; german; style           
## 15    15 russian; ethnic; conflict; israeli; syrian       
## 16    16 spanish; american; italian; immigrant; indigenous

Otherwise, we return again to the idea of building an interactive visualisation using JavaScript. We can write all of the model data as a local JSON file with the dsst_json_lda function:

dsst_json_lda(model, docs)

The results are, by default, stored in the output directory of your class notes as a file named ““. If we go to the website Topic Model Visualizer and upload the JSON file that we just produced, it will create an interactive visualisation of the topics for us to explore.