Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.
After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.
I have set the options message=FALSE
and
echo=FALSE
to avoid cluttering your solutions with all the
output from this code.
Today we will again look at a subset of a well-known text analysis corpus call NewsGroups-20.
<- read_csv("../data/newsgroups.csv.bz2")
docs <- read_csv("../data/newsgroups_token.csv.bz2") anno
Use K-means to cluster the categories with K equal to 5 (that’s the default) and 2 PCA dimensions (that’s also the default).
# Question 01
%>%
anno inner_join(select(docs, -text), by = "doc_id") %>%
dsst_pca(doc_var = "label", n_dims = 2) %>%
dsst_kmeans(n_clusters = 5L)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
## # A tibble: 20 × 4
## label v1 v2 cluster
## <chr> <dbl> <dbl> <dbl>
## 1 alt.atheism 0.254 0.293 4
## 2 comp.graphics -0.252 0.188 1
## 3 comp.os.ms-windows.misc -0.330 0.286 1
## 4 comp.sys.ibm.pc.hardware -0.369 0.265 1
## 5 comp.sys.mac.hardware -0.321 0.216 1
## 6 comp.windows.x -0.257 0.248 1
## 7 misc.forsale -0.258 0.0577 1
## 8 rec.autos -0.0191 -0.118 2
## 9 rec.motorcycles -0.0150 -0.132 2
## 10 rec.sport.baseball 0.0372 -0.337 5
## 11 rec.sport.hockey 0.0419 -0.329 5
## 12 sci.crypt 0.0249 0.0980 2
## 13 sci.electronics -0.146 0.0430 2
## 14 sci.med 0.0552 -0.0248 2
## 15 sci.space -0.0200 -0.0580 2
## 16 soc.religion.christian 0.310 0.359 4
## 17 talk.politics.guns 0.225 0.163 3
## 18 talk.politics.mideast 0.177 0.0805 3
## 19 talk.politics.misc 0.244 0.138 3
## 20 talk.religion.misc 0.353 0.407 4
Now, plot these cluster using color on a plot with the first two principal components. Make sure to force R to treat the color as a categorical variable. Visually verify that the clusters make visual sense.
# Question 02
%>%
anno inner_join(select(docs, -text), by = "doc_id") %>%
dsst_pca(doc_var = "label", n_dims = 2) %>%
dsst_kmeans(n_clusters = 5L) %>%
ggplot(aes(v1, v2)) +
geom_point(aes(color = factor(cluster)))
Now, run hierarchical clustering on the categories. Save the model as an R object.
# Question 03
<- anno %>%
model inner_join(select(docs, -text), by = "doc_id") %>%
dsst_pca(doc_var = "label", n_dims = 2) %>%
dsst_hclust()
Plot the hierarchical clustering and notice what additional information it gives compared to K-means.
# Question 04
plot(model)
Now, plot the colors of the hierarchical clusters on a plot with the first two principal components (this requires a bit more code) using 5 clusters. Take note of any differences in relation to the kmeans clustering.
# Question 05
%>%
anno inner_join(select(docs, -text), by = "doc_id") %>%
dsst_pca(doc_var = "label", n_dims = 2) %>%
inner_join(dsst_hclust_cut(model, nclust = 5), by = c("label" = "document")) %>%
ggplot(aes(v1, v2)) +
geom_point(aes(color = factor(cluster)))
Repeat the previous question, but this time use only 3 clusters. Take note of how the clusters are grouped together relative to the previous plot.
# Question 06
%>%
anno inner_join(select(docs, -text), by = "doc_id") %>%
dsst_pca(doc_var = "label", n_dims = 2) %>%
inner_join(dsst_hclust_cut(model, nclust = 3), by = c("label" = "document")) %>%
ggplot(aes(v1, v2)) +
geom_point(aes(color = factor(cluster)))
Now, let’s do a few more involved tasks with the data using the
clusters. In the code below, group the newsgroups into 5 clusters using
kmeans, including only nouns and verbs. Save the result as a data frame
called clusters
.
# Question 07
<- anno %>%
clusters filter(upos %in% c("NOUN", "VERB")) %>%
inner_join(select(docs, -text), by = "doc_id") %>%
dsst_pca(doc_var = "label", n_dims = 2) %>%
dsst_kmeans(n_clusters = 5L)
Next, compute the 5 words most associated (according the G-scores)
with the validation set of each cluster. Again, only use nouns and
verbs. Join these top terms together using the paste
function.
# Question 08
<- docs %>%
docs_new inner_join(clusters, by = "label")
%>%
anno filter(upos %in% c("NOUN", "VERB")) %>%
dsst_metrics(docs_new, label_var = "cluster") %>%
filter(train_id == "valid") %>%
group_by(label) %>%
arrange(desc(gscore)) %>% # actually unneeded since already sorted this way
slice_head(n = 5) %>%
summarize(top_terms = paste(token, collapse = " | "))
## # A tibble: 5 × 2
## label top_terms
## <dbl> <chr>
## 1 1 game | team | play | suck | player
## 2 2 drive | have | monitor | board | card
## 3 3 _ | image | file | color | wire
## 4 4 say | people | law | church | believe
## 5 5 mission | key | encryption | request | car
Next, fit an elastic net model to predict the cluster from which each message come from. Again, only use nouns and verbs. Produce a confusion matrix of the validation data.
# Question 09
<- anno %>%
model filter(upos %in% c("NOUN", "VERB")) %>%
dsst_enet_build(docs_new, label_var = "cluster")
$docs %>%
modelfilter(train_id == "valid") %>%
select(cluster, pred_label) %>%
table()
## pred_label
## cluster 1 2 3 4 5
## 1 36 0 9 53 2
## 2 0 23 40 34 3
## 3 2 11 137 80 20
## 4 0 1 12 268 19
## 5 1 2 23 115 109
Finally, using the same model output, take the docs object returned
by the model and create a new column called new_label
that
is equal to the cluster
pasted together with the
label
. Build something similar to a confusion matrix for
the validation data, with the new_label
on the rows and the
predicted cluster (its called pred_label
) on the columns.
Does this show that certain newsgroups are to blame for a large share of
the confused terms? In what ways?
# Question 10
$docs %>%
modelfilter(train_id == "valid") %>%
mutate(new_label = paste(cluster, label)) %>%
select(new_label, pred_label) %>%
table()
## pred_label
## new_label 1 2 3 4 5
## 1 rec.sport.baseball 14 0 8 27 1
## 1 rec.sport.hockey 22 0 1 26 1
## 2 comp.sys.ibm.pc.hardware 0 14 19 15 2
## 2 comp.sys.mac.hardware 0 9 21 19 1
## 3 comp.graphics 0 1 27 19 3
## 3 comp.os.ms-windows.misc 0 4 27 15 4
## 3 comp.windows.x 0 3 32 13 2
## 3 misc.forsale 2 0 33 11 4
## 3 sci.electronics 0 3 18 22 7
## 4 alt.atheism 0 0 1 45 4
## 4 soc.religion.christian 0 0 3 44 3
## 4 talk.politics.guns 0 0 1 45 4
## 4 talk.politics.mideast 0 0 1 47 2
## 4 talk.politics.misc 0 1 4 40 5
## 4 talk.religion.misc 0 0 2 47 1
## 5 rec.autos 1 0 6 17 26
## 5 rec.motorcycles 0 0 1 13 36
## 5 sci.crypt 0 1 6 21 22
## 5 sci.med 0 1 4 36 9
## 5 sci.space 0 0 6 28 16
Produce a PCA plot of the nouns, using the option invert = TRUE. You
will want to adjust the parameter min_df
to be
0.01
.
# Question 11
%>%
anno filter(upos %in% c("NOUN")) %>%
dsst_pca(invert = TRUE, min_df = 0.01) %>%
ggplot(aes(v1, v2)) +
geom_text(aes(label = doc_id))
Produce a PCA plot of the verbs, using the option invert = TRUE. You
will want to adjust the parameter min_df
to be
0.01
.
# Question 12
%>%
anno filter(upos %in% c("VERB")) %>%
dsst_pca(invert = TRUE, min_df = 0.01) %>%
ggplot(aes(v1, v2)) +
geom_text(aes(label = doc_id))
What interesting patterns do you see in the previous two plots?