So far, we have seen how to perform supervised learning tasks. That is, building predictive models. Another task in machine learning is unsupervised learning, where we want to learn features of a dataset without a particular variable that we are trying to uncover. There are two common classes of unsupervised learning:
In this tutorial, we focus on the task of dimensionality reduction. It will be useful for term frequency matrices and also leads nicely into next week's introduction to neural networks.
To start, let's load in a number of modules that will be useful for the tutorial:
import wiki import iplot import wikitext import numpy as np import pandas as pd import matplotlib.pyplot as plt import sklearn from sklearn.preprocessing import normalize
%matplotlib inline import matplotlib matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)
assert wiki.__version__ >= 6 assert wikitext.__version__ >= 3 assert iplot.__version__ >= 3
For one last time, let's make use of the list of novelists and poets dataset.
import re data = wiki.get_wiki_json("List_of_American_novelists") data_html = data['text']['*'] authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html) nov_authors = authors[:(authors.index('Leane_Zugsmith') + 1)] data = wiki.get_wiki_json("List_of_poets_from_the_United_States") data_html = data['text']['*'] authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html) poe_authors = authors[:(authors.index('Louis_Zukofsky') + 1)] nov_authors = list(set(nov_authors) - set(poe_authors)) poe_authors = list(set(poe_authors) - set(nov_authors)) links = nov_authors + poe_authors y_vals = np.array( * len(nov_authors) +  * len(poe_authors))
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)
And grab the term-frequency matrix and list of words. We will also produce a new version of the term frequency matrix that normalizes the size of each column. This is similar to the TF-IDF matrix we constructed several weeks ago.
tf_mat = wcorp.sparse_tf().transpose() tf_norm = normalize(tf_mat, norm='l2', axis=1) words = wcorp.terms() print(tf_norm.shape)
One of the most common methods for dimensionality reduction is a method called principal component analysis, or PCA. If you are familiar with matrix algebra, it is closely related to the concept of the singular value decomposition (SVD). Here is how to use the sklearn module to project a dataset using the PCA (or, as they call it here, the SVD):
pca = sklearn.decomposition.TruncatedSVD(n_components=2) embed = pca.fit_transform(tf_norm) print(embed.shape)
Notice that our original dataset with over 18k columns has been converted into a new dataset with only two columns. Let's visualize what these columns measure:
df = pd.DataFrame(dict(xval = embed[:, 0], yval = embed[:, 1], link = wcorp.meta.link, title = wcorp.meta.title, num_sections = wcorp.meta.num_sections, poet_novel = y_vals)) fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='poet_novel') iplot.show(fig)
Notice that, without any direct knowledge of the distinction between poets and novelists, the plot naturally seperates these two. Click on a few of the points that are clustered in the "wrong" place; what do you notice?
Coloring by the number of sections shows the other side of what is captured in the embedding:
df = pd.DataFrame(dict(xval = embed[:, 0], yval = embed[:, 1], link = wcorp.meta.link, title = wcorp.meta.title, num_sections = wcorp.meta.num_sections, poet_novel = y_vals)) fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='num_sections') iplot.show(fig)
Roughly, the dimension perpendicular to the poet/novelist distinction measures how long the article is.
There also exist more complex, non-linear, embeddings that can help to capture other features in high-dimensional data. One popular example is known as t-SNE. For computational reasons, we need to first project into a relatively high dimensional PCA projection:
pca = sklearn.decomposition.TruncatedSVD(n_components=50)
embed = pca.fit_transform(tf_norm) embed.shape
And then, re-project these 50-dimensions into 2 using the non-linear technique:
tsne = sklearn.manifold.TSNE(perplexity=25, n_iter=300) tembed = tsne.fit_transform(embed)
You can see that the results are more evenly distributed across the two-dimensional space but still capture the poet/novelist distinction.
df = pd.DataFrame(dict(xval = tembed[:, 0], yval = tembed[:, 1], link = wcorp.meta.link, title = wcorp.meta.title, poet_novel = y_vals)) fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='poet_novel') iplot.show(fig)
I'll admit, this is not a very convincing example for doing the extra work of running t-SNE. But, there are many instances (image analysis, for example) where the results are extremely more understandble.