## Tutorial 26: Unsupervised learning¶

So far, we have seen how to perform supervised learning tasks. That is, building predictive models. Another task in machine learning is unsupervised learning, where we want to learn features of a dataset without a particular variable that we are trying to uncover. There are two common classes of unsupervised learning:

1. clustering: breaking the input data into groups; we have already seen this from both network analysis and spectral clustering on the words
2. dimensionality reduction: taking a dataset with many variables and producing a new dataset with a smaller number of variables that capture the most dominant features of the original space.

In this tutorial, we focus on the task of dimensionality reduction. It will be useful for term frequency matrices and also leads nicely into next week's introduction to neural networks.

To start, let's load in a number of modules that will be useful for the tutorial:

In [1]:
import wiki
import iplot
import wikitext

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import normalize

In [2]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)

In [3]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 3
assert iplot.__version__ >= 3


For one last time, let's make use of the list of novelists and poets dataset.

In [4]:
import re

data = wiki.get_wiki_json("List_of_American_novelists")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
nov_authors = authors[:(authors.index('Leane_Zugsmith') + 1)]

data = wiki.get_wiki_json("List_of_poets_from_the_United_States")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
poe_authors = authors[:(authors.index('Louis_Zukofsky') + 1)]

nov_authors = list(set(nov_authors) - set(poe_authors))
poe_authors = list(set(poe_authors) - set(nov_authors))

y_vals = np.array([0] * len(nov_authors) + [1] * len(poe_authors))

In [5]:
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)


And grab the term-frequency matrix and list of words. We will also produce a new version of the term frequency matrix that normalizes the size of each column. This is similar to the TF-IDF matrix we constructed several weeks ago.

In [6]:
tf_mat = wcorp.sparse_tf().transpose()
tf_norm = normalize(tf_mat, norm='l2', axis=1)
words = wcorp.terms()

print(tf_norm.shape)

(2829, 18603)


### Principal components¶

One of the most common methods for dimensionality reduction is a method called principal component analysis, or PCA. If you are familiar with matrix algebra, it is closely related to the concept of the singular value decomposition (SVD). Here is how to use the sklearn module to project a dataset using the PCA (or, as they call it here, the SVD):

In [7]:
pca = sklearn.decomposition.TruncatedSVD(n_components=2)
embed = pca.fit_transform(tf_norm)
print(embed.shape)

(2829, 2)


Notice that our original dataset with over 18k columns has been converted into a new dataset with only two columns. Let's visualize what these columns measure:

In [8]:
df = pd.DataFrame(dict(xval = embed[:, 0],
yval = embed[:, 1],
title = wcorp.meta.title,
num_sections = wcorp.meta.num_sections,
poet_novel = y_vals))

fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='poet_novel')
iplot.show(fig)


Notice that, without any direct knowledge of the distinction between poets and novelists, the plot naturally seperates these two. Click on a few of the points that are clustered in the "wrong" place; what do you notice?

Coloring by the number of sections shows the other side of what is captured in the embedding:

In [9]:
df = pd.DataFrame(dict(xval = embed[:, 0],
yval = embed[:, 1],
title = wcorp.meta.title,
num_sections = wcorp.meta.num_sections,
poet_novel = y_vals))

fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='num_sections')
iplot.show(fig)


Roughly, the dimension perpendicular to the poet/novelist distinction measures how long the article is.

### t-Distributed Stochastic Neighbor Embedding¶

There also exist more complex, non-linear, embeddings that can help to capture other features in high-dimensional data. One popular example is known as t-SNE. For computational reasons, we need to first project into a relatively high dimensional PCA projection:

In [10]:
pca = sklearn.decomposition.TruncatedSVD(n_components=50)

In [11]:
embed = pca.fit_transform(tf_norm)
embed.shape

Out[11]:
(2829, 50)

And then, re-project these 50-dimensions into 2 using the non-linear technique:

In [12]:
tsne = sklearn.manifold.TSNE(perplexity=25, n_iter=300)
tembed = tsne.fit_transform(embed)


You can see that the results are more evenly distributed across the two-dimensional space but still capture the poet/novelist distinction.

In [13]:
df = pd.DataFrame(dict(xval = tembed[:, 0],
yval = tembed[:, 1],