Tutorial 26: Unsupervised learning

So far, we have seen how to perform supervised learning tasks. That is, building predictive models. Another task in machine learning is unsupervised learning, where we want to learn features of a dataset without a particular variable that we are trying to uncover. There are two common classes of unsupervised learning:

  1. clustering: breaking the input data into groups; we have already seen this from both network analysis and spectral clustering on the words
  2. dimensionality reduction: taking a dataset with many variables and producing a new dataset with a smaller number of variables that capture the most dominant features of the original space.

In this tutorial, we focus on the task of dimensionality reduction. It will be useful for term frequency matrices and also leads nicely into next week's introduction to neural networks.

Load modules and data

To start, let's load in a number of modules that will be useful for the tutorial:

In [1]:
import wiki
import iplot
import wikitext

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import normalize
Loading BokehJS ...
In [2]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)
In [3]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 3
assert iplot.__version__ >= 3

For one last time, let's make use of the list of novelists and poets dataset.

In [4]:
import re

data = wiki.get_wiki_json("List_of_American_novelists")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
nov_authors = authors[:(authors.index('Leane_Zugsmith') + 1)]

data = wiki.get_wiki_json("List_of_poets_from_the_United_States")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
poe_authors = authors[:(authors.index('Louis_Zukofsky') + 1)]

nov_authors = list(set(nov_authors) - set(poe_authors))
poe_authors = list(set(poe_authors) - set(nov_authors))
links = nov_authors + poe_authors

y_vals = np.array([0] * len(nov_authors) + [1] * len(poe_authors))
In [5]:
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)

And grab the term-frequency matrix and list of words. We will also produce a new version of the term frequency matrix that normalizes the size of each column. This is similar to the TF-IDF matrix we constructed several weeks ago.

In [6]:
tf_mat = wcorp.sparse_tf().transpose()
tf_norm = normalize(tf_mat, norm='l2', axis=1)
words = wcorp.terms()

print(tf_norm.shape)
(2829, 18603)

Principal components

One of the most common methods for dimensionality reduction is a method called principal component analysis, or PCA. If you are familiar with matrix algebra, it is closely related to the concept of the singular value decomposition (SVD). Here is how to use the sklearn module to project a dataset using the PCA (or, as they call it here, the SVD):

In [7]:
pca = sklearn.decomposition.TruncatedSVD(n_components=2)
embed = pca.fit_transform(tf_norm)
print(embed.shape)
(2829, 2)

Notice that our original dataset with over 18k columns has been converted into a new dataset with only two columns. Let's visualize what these columns measure:

In [8]:
df = pd.DataFrame(dict(xval = embed[:, 0],
                       yval = embed[:, 1],
                       link = wcorp.meta.link,
                       title = wcorp.meta.title,
                       num_sections = wcorp.meta.num_sections,
                       poet_novel = y_vals))

fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='poet_novel')
iplot.show(fig)

Notice that, without any direct knowledge of the distinction between poets and novelists, the plot naturally seperates these two. Click on a few of the points that are clustered in the "wrong" place; what do you notice?

Coloring by the number of sections shows the other side of what is captured in the embedding:

In [9]:
df = pd.DataFrame(dict(xval = embed[:, 0],
                       yval = embed[:, 1],
                       link = wcorp.meta.link,
                       title = wcorp.meta.title,
                       num_sections = wcorp.meta.num_sections,
                       poet_novel = y_vals))

fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='num_sections')
iplot.show(fig)

Roughly, the dimension perpendicular to the poet/novelist distinction measures how long the article is.

t-Distributed Stochastic Neighbor Embedding

There also exist more complex, non-linear, embeddings that can help to capture other features in high-dimensional data. One popular example is known as t-SNE. For computational reasons, we need to first project into a relatively high dimensional PCA projection:

In [10]:
pca = sklearn.decomposition.TruncatedSVD(n_components=50)
In [11]:
embed = pca.fit_transform(tf_norm)
embed.shape
Out[11]:
(2829, 50)

And then, re-project these 50-dimensions into 2 using the non-linear technique:

In [12]:
tsne = sklearn.manifold.TSNE(perplexity=25, n_iter=300)
tembed = tsne.fit_transform(embed)

You can see that the results are more evenly distributed across the two-dimensional space but still capture the poet/novelist distinction.

In [13]:
df = pd.DataFrame(dict(xval = tembed[:, 0],
                       yval = tembed[:, 1],
                       link = wcorp.meta.link,
                       title = wcorp.meta.title,
                       poet_novel = y_vals))

fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='poet_novel')
iplot.show(fig)

I'll admit, this is not a very convincing example for doing the extra work of running t-SNE. But, there are many instances (image analysis, for example) where the results are extremely more understandble.