So far, we have seen how to perform supervised learning tasks. That is, building
predictive models. Another task in machine learning is *unsupervised learning*,
where we want to learn features of a dataset without a particular variable that
we are trying to uncover. There are two common classes of unsupervised learning:

- clustering: breaking the input data into groups; we have already seen this from both network analysis and spectral clustering on the words
- dimensionality reduction: taking a dataset with many variables and producing a new dataset with a smaller number of variables that capture the most dominant features of the original space.

In this tutorial, we focus on the task of dimensionality reduction. It will be useful for term frequency matrices and also leads nicely into next week's introduction to neural networks.

To start, let's load in a number of modules that will be useful for the tutorial:

In [1]:

```
import wiki
import iplot
import wikitext
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import normalize
```

In [2]:

```
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)
```

In [3]:

```
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 3
assert iplot.__version__ >= 3
```

For one last time, let's make use of the list of novelists and poets dataset.

In [4]:

```
import re
data = wiki.get_wiki_json("List_of_American_novelists")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
nov_authors = authors[:(authors.index('Leane_Zugsmith') + 1)]
data = wiki.get_wiki_json("List_of_poets_from_the_United_States")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
poe_authors = authors[:(authors.index('Louis_Zukofsky') + 1)]
nov_authors = list(set(nov_authors) - set(poe_authors))
poe_authors = list(set(poe_authors) - set(nov_authors))
links = nov_authors + poe_authors
y_vals = np.array([0] * len(nov_authors) + [1] * len(poe_authors))
```

In [5]:

```
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)
```

And grab the term-frequency matrix and list of words. We will also produce a new
version of the term frequency matrix that *normalizes* the size of each column. This
is similar to the TF-IDF matrix we constructed several weeks ago.

In [6]:

```
tf_mat = wcorp.sparse_tf().transpose()
tf_norm = normalize(tf_mat, norm='l2', axis=1)
words = wcorp.terms()
print(tf_norm.shape)
```

One of the most common methods for dimensionality reduction is a method called principal
component analysis, or PCA. If you are familiar with matrix algebra, it is closely related
to the concept of the singular value decomposition (SVD). Here is how to use the **sklearn**
module to project a dataset using the PCA (or, as they call it here, the SVD):

In [7]:

```
pca = sklearn.decomposition.TruncatedSVD(n_components=2)
embed = pca.fit_transform(tf_norm)
print(embed.shape)
```

Notice that our original dataset with over 18k columns has been converted into a new dataset with only two columns. Let's visualize what these columns measure:

In [8]:

```
df = pd.DataFrame(dict(xval = embed[:, 0],
yval = embed[:, 1],
link = wcorp.meta.link,
title = wcorp.meta.title,
num_sections = wcorp.meta.num_sections,
poet_novel = y_vals))
fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='poet_novel')
iplot.show(fig)
```

Notice that, without any direct knowledge of the distinction between poets and novelists, the plot naturally seperates these two. Click on a few of the points that are clustered in the "wrong" place; what do you notice?

Coloring by the number of sections shows the other side of what is captured in the embedding:

In [9]:

```
df = pd.DataFrame(dict(xval = embed[:, 0],
yval = embed[:, 1],
link = wcorp.meta.link,
title = wcorp.meta.title,
num_sections = wcorp.meta.num_sections,
poet_novel = y_vals))
fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='num_sections')
iplot.show(fig)
```

Roughly, the dimension perpendicular to the poet/novelist distinction measures how long the article is.

There also exist more complex, non-linear, embeddings that can help to capture
other features in high-dimensional data. One popular example is known as *t-SNE*.
For computational reasons, we need to first project into a relatively high dimensional
PCA projection:

In [10]:

```
pca = sklearn.decomposition.TruncatedSVD(n_components=50)
```

In [11]:

```
embed = pca.fit_transform(tf_norm)
embed.shape
```

Out[11]:

And then, re-project these 50-dimensions into 2 using the non-linear technique:

In [12]:

```
tsne = sklearn.manifold.TSNE(perplexity=25, n_iter=300)
tembed = tsne.fit_transform(embed)
```

You can see that the results are more evenly distributed across the two-dimensional space but still capture the poet/novelist distinction.

In [13]:

```
df = pd.DataFrame(dict(xval = tembed[:, 0],
yval = tembed[:, 1],
link = wcorp.meta.link,
title = wcorp.meta.title,
poet_novel = y_vals))
fig = iplot.create_figure(df, 'xval', 'yval', url='link', color='poet_novel')
iplot.show(fig)
```

I'll admit, this is not a very convincing example for doing the extra work of running t-SNE. But, there are many instances (image analysis, for example) where the results are extremely more understandble.