As mentioned in class, you do not need to read the notes for today ahead of time. We will discuss them together in class. Please spend the extra time working on your Project 3.
We will start by looking at a collection of Wikipedia pages related to sovereign states (i.e., countries):
<- read_csv(file.path("..", "data", "wiki_list_of_sovereign_states.csv"))
docs <- read_csv(file.path("..", "data", "wiki_list_of_sovereign_states_anno.csv.gz")) anno
Today we will use these to illustrate a new technique for understanding a large collection of documents.
Now, we will investigate a method for topic modeling. This is an unsupervised task that seeks to identify topics within a corpus of text. What exactly is a topic? Mathematically speaking, it is usually defined as a probability distribution over a collection of words. Words that have a high probability within a topic tend to classify the topics themes in a colloquial sense. For example, a topic that captures the idea of baseball would have high probabilities on words such as “base”, “player”, “strike”, “umpire”, “team”, and so forth.
We will be use a model today called Latent Dirchlet Allocation, or more commonly LDA. Given a fixed number of topics and fixed set of words (called a lexicon), LDA assumes that documents consist of a random collection of words constructed according to the following model:
This model is a great example of the adage that “all models are wrong, but some are useful”. Clearly, this is not how documents are constructed, and words are not independent of one another. However, the approximation is close enough to produce a useful abstraction for detecting themes within a corpus of textual documents.
You will notice that the description above is in some ways backwards from reality. It assumes that we know the distribution of the topics over the words and documents but do not know what words are present in the documents. In fact, we know the words but not the topics! This is an example of a Bayesian model. If we wrote down the assumptions rigorously, we could invert the probabilities using Bayes’ Theorem. That is, instead of knowing the probability of the documents given the topics, we can determine the probability of the topics given the documents. It is not possible to do this analytically, however, and a simulation method is needed to figure out what distribution of topics over the words and documents is most likely to have produced the observed data.
As a final note about the method, the name comes from the standard distribution used to determine the topics (the Dirichlet distribution) and the fact that the topics themselves are never observed (that is, the are latent).
Now, let’s actually compute an LDA topic model using the function
dsst_lda_build
. It requires setting the number of topics;
16 is a good starting number. I tend to use only nouns, adjectives,
adverbs, and verbs for the LDA model. The model can take a few minutes
to finish; it should print out results as the algorithm proceeds.
<- anno %>%
model filter(upos %in% c("NOUN", "VERBS", "ADJ", "ADV")) %>%
dsst_lda_build(num_topics = 16)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
The results in the object model
are structured as data
tables. You can use them directly if you want to do some direct EDA on
the object. For example, here are the documents most associated with
each topic:
$docs %>%
modelgroup_by(topic) %>%
arrange(desc(prob)) %>%
slice_head(n = 5) %>%
summarise(countries = paste(doc_id, collapse = "; "))
## # A tibble: 16 × 2
## topic countries
## <int> <chr>
## 1 1 Guinea; Equatorial Guinea; Sierra Leone; Papua New Guinea; Liberia
## 2 2 Canada; Russia; United States; Australia; North Korea
## 3 3 Tuvalu; Cook Islands; Saint Kitts and Nevis; Kiribati; Federated S…
## 4 4 Saudi Arabia; Oman; Egypt; Bahrain; United Arab Emirates
## 5 5 Kazakhstan; Kuwait; Niue; Iceland; Qatar
## 6 6 Bolivia; Colombia; Ecuador; El Salvador; Peru
## 7 7 Burundi; Uganda; Central African Republic; Eritrea; South Sudan
## 8 8 United Kingdom; Republic of Ireland; Northern Cyprus; New Zealand;…
## 9 9 Malawi; Benin; Ghana; Ivory Coast; Senegal
## 10 10 Vatican City; Montenegro; Monaco; Andorra; San Marino
## 11 11 Zambia; South Africa; Nepal; Botswana; Eswatini
## 12 12 Norway; Kingdom of the Netherlands; Sweden; Finland; Denmark
## 13 13 Taiwan; South Korea; Japan; China; Philippines
## 14 14 Czech Republic; Greece; Germany; Hungary; Italy
## 15 15 Lebanon; Syria; Abkhazia; Israel; State of Palestine
## 16 16 Spain; Mexico; Dominican Republic; Guatemala; Chile
Likewise, here are the words most associated with each topic:
$terms %>%
modelgroup_by(topic) %>%
arrange(desc(beta)) %>%
slice_head(n = 5) %>%
summarise(words = paste(token, collapse = "; "))
## # A tibble: 16 × 2
## topic words
## <int> <chr>
## 1 1 guinea; coup; african; oil; french
## 2 2 federal; russian; worlds; nuclear; percent
## 3 3 island; british; sea; species; colony
## 4 4 british; oil; arab; egyptian; woman
## 5 5 tax; income; company; km; private
## 6 6 Portuguese; indigenous; coast; species; José
## 7 7 child; ethnic; opposition; african; woman
## 8 8 cent; per; Turkish; british; irish
## 9 9 french; african; slave; index; music
## 10 10 ethnic; parliament; border; italian; french
## 11 11 indian; british; species; african; southern
## 12 12 parliament; king; municipality; county; immigrant
## 13 13 chinese; japanese; asian; island; dynasty
## 14 14 european; modern; music; german; style
## 15 15 russian; ethnic; conflict; israeli; syrian
## 16 16 spanish; american; italian; immigrant; indigenous
Otherwise, we return again to the idea of building an interactive
visualisation using JavaScript. We can write all of the model data as a
local JSON file with the dsst_json_lda
function:
dsst_json_lda(model, docs)
The results are, by default, stored in the output directory of your class notes as a file named ““. If we go to the website Topic Model Visualizer and upload the JSON file that we just produced, it will create an interactive visualisation of the topics for us to explore.