Today, we will learn how to build predictive models that classify textual documents by the words used in the document.

Start by loading our standard modules and make sure that everything is working as expected.

In [1]:

```
import wiki
import iplot
import wikitext
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import glmnet
```

In [2]:

```
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)
```

In [3]:

```
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 3
assert iplot.__version__ >= 3
```

Our dataset here will be the pages linkined to from the List of American novelists and List of poets from the United States. Get the pages by grabing the bulk download from my website to speed things up:

In [4]:

```
#wiki.bulk_download('novel-poem', force=True)
```

The code below constructs seperate lists of novelists and poets, making sure
to remove anyone on both lists. Finally it constructs an output vector `y_vals`

that is 0 for authors and 1 for poets.

In [5]:

```
import re
data = wiki.get_wiki_json("List_of_American_novelists")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
nov_authors = authors[:(authors.index('Leane_Zugsmith') + 1)]
data = wiki.get_wiki_json("List_of_poets_from_the_United_States")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
poe_authors = authors[:(authors.index('Louis_Zukofsky') + 1)]
nov_authors = list(set(nov_authors) - set(poe_authors))
poe_authors = list(set(poe_authors) - set(nov_authors))
links = nov_authors + poe_authors
y_vals = np.array([0] * len(nov_authors) + [1] * len(poe_authors))
```

Finally, create a `wcorp`

object to wrap up all of the information we need for our analysis.

In [6]:

```
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)
```

Recall that the `WikiCorpus`

object has a function for returning the term
frequency matrix. Here, we grab the sparse version of the matrix because it
is much smaller and can be passed directly to most sklearn algorithms. Here,
it should have over 18k rows:

In [7]:

```
tf_mat = wcorp.sparse_tf().transpose()
tf_mat.shape
```

Out[7]:

Also, it will be useful to grab the names of the words in each column (here, we print out the first 100 terms):

In [8]:

```
words = wcorp.terms()
words[:100]
```

Out[8]:

Now, consider using the matrix `tf_mat`

in a predictive model. Here it has 18k+ columns;
in general, it is impossible to learn 18k parameters (as in a linear regression)
with only 2800 observations. We need a method that is able to handle such models.

Consider a simple linear regression model. We have mentioned that the the ordinary least squares estimator is defined by minimizing the sum of squared residuals:

$$ \text{LEAST SQUARES} \rightarrow \arg\min_{a, b} \left\{ \sum_i \left( y_i - a - b \cdot x_i \right)^2 \right\}$$

The lasso estimator modifies this slightly by adding a *penalty* term the entices
the model to make the slove parameter smaller:

$$ \text{LASSO} \rightarrow \arg\min_{a, b} \left\{ \sum_i \left( y_i - a - b \cdot x_i \right)^2 + \lambda \cdot | b | \right\}$$

For multivariate data, this becomes (for those familiar with vector norms):

$$ \text{LASSO} \rightarrow \arg\min_{\beta} \left\{ || y - \beta X ||_2^2 + \lambda \cdot || b ||_1 \right\}$$

And finally, the elastic net is given by:

$$ \text{ELASTIC NET} \rightarrow \arg\min_{\beta} \left\{ || y - \beta X ||_2^2 + \lambda \cdot \rho || b ||_1 + \lambda \cdot (1 - \rho) || b ||_2^2 \right\}$$

The details for us in this course are not important; what should be taken away is that we have a model that forces slope parameters to be zero unless they are particularly useful in the prediction task. It turns out that this is particularly useful for text prediction.

We can create a logistic elasic net according to the same approach used in other sklearn estimators:

In [9]:

```
lnet = glmnet.LogitNet()
lnet
```

Out[9]:

And, as with other estimators, we fit the data using the `fit`

method:

In [10]:

```
lnet.fit(tf_mat, y_vals)
```

Out[10]:

As well as constructing predictions using the predict function:

In [11]:

```
y_pred = lnet.predict(tf_mat)
```

And see how well it performs:

In [12]:

```
sklearn.metrics.accuracy_score(y_vals, y_pred)
```

Out[12]:

More than the model itself, though, the most interesting thing about the elastic net is seeing what variables were choosen by the algorithm. To start, it is helpful to wrap up a list that matches each word to its coefficent:

In [13]:

```
vals = list(zip(words, lnet.coef_[0, :]))
vals[:10]
```

Out[13]:

And then, sort the results by the coefficent and show the non-zero values.

In [14]:

```
for x in sorted(vals, key=lambda x: x[1], reverse=True):
if x[1] != 0:
print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
```

Remember that poets are coded as 1's and novelists as 0's; so positive terms are correlated with poets and negatives are novelists. Do the results make sense to you?

The value for $\lambda$ in the elastic net is choosen by trying up to 100 values and using a technique for determining which one is best. Sometimes it is also useful to look at non-optimal values, for example if the optimal output contains too many or too few terms to understand the structure of the data. Here, we grab the 30th largest value of the tuning parameter:

In [15]:

```
vals = list(zip(words, lnet.coef_path_[0, :, 26]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
if x[1] != 0:
print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
```

Do the new values make sense to you? Do any seem superfluous?

Let's use the same dataset applied to a different respones variable: whether the page has been translated into German. Note that positive values are associated with translated pages and negative values are not.

In [16]:

```
lan_version = np.array(['de' in x for x in wcorp.meta['langs']], dtype=np.int)
lnet = glmnet.LogitNet()
lnet.fit(tf_mat, lan_version)
vals = list(zip(words, lnet.coef_[0, :]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
if x[1] != 0:
print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
```

Similarly, for French:

In [17]:

```
lan_version = np.array(['fr' in x for x in wcorp.meta['langs']], dtype=np.int)
lnet = glmnet.LogitNet()
lnet.fit(tf_mat, lan_version)
vals = list(zip(words, lnet.coef_[0, :]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
if x[1] != 0:
print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
```

And once more, in Chinese:

In [18]:

```
lan_version = np.array(['zh' in x for x in wcorp.meta['langs']], dtype=np.int)
lnet = glmnet.LogitNet()
lnet.fit(tf_mat, lan_version)
vals = list(zip(words, lnet.coef_[0, :]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
if x[1] != 0:
print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
```

What patterns do you see in the data here? **Does it tell you anything about the
nature of Wikipedia?**

I also found predicting whether a page has more than 2 images to be similarly interesting:

In [19]:

```
image_flag = wcorp.meta['num_images'].values > 2
lnet = glmnet.LogitNet()
lnet.fit(tf_mat, image_flag)
vals = list(zip(words, lnet.coef_[0, :]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
if x[1] != 0:
print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
```

**Any take aways from this set of words?**