Tutorial 25: Predicting with Words — The Elastic Net

Today, we will learn how to build predictive models that classify textual documents by the words used in the document.

Loading modules

Start by loading our standard modules and make sure that everything is working as expected.

In [1]:
import wiki
import iplot
import wikitext

import numpy as np
import matplotlib.pyplot as plt
import sklearn

import glmnet
Loading BokehJS ...
In [2]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)
In [3]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 3
assert iplot.__version__ >= 3

Dataset

Our dataset here will be the pages linkined to from the List of American novelists and List of poets from the United States. Get the pages by grabing the bulk download from my website to speed things up:

In [4]:
#wiki.bulk_download('novel-poem', force=True)

The code below constructs seperate lists of novelists and poets, making sure to remove anyone on both lists. Finally it constructs an output vector y_vals that is 0 for authors and 1 for poets.

In [5]:
import re

data = wiki.get_wiki_json("List_of_American_novelists")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
nov_authors = authors[:(authors.index('Leane_Zugsmith') + 1)]

data = wiki.get_wiki_json("List_of_poets_from_the_United_States")
data_html = data['text']['*']
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
poe_authors = authors[:(authors.index('Louis_Zukofsky') + 1)]

nov_authors = list(set(nov_authors) - set(poe_authors))
poe_authors = list(set(poe_authors) - set(nov_authors))
links = nov_authors + poe_authors

y_vals = np.array([0] * len(nov_authors) + [1] * len(poe_authors))

Finally, create a wcorp object to wrap up all of the information we need for our analysis.

In [6]:
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)

Textual training data

Recall that the WikiCorpus object has a function for returning the term frequency matrix. Here, we grab the sparse version of the matrix because it is much smaller and can be passed directly to most sklearn algorithms. Here, it should have over 18k rows:

In [7]:
tf_mat = wcorp.sparse_tf().transpose()
tf_mat.shape
Out[7]:
(2829, 18603)

Also, it will be useful to grab the names of the words in each column (here, we print out the first 100 terms):

In [8]:
words = wcorp.terms()
words[:100]
Out[8]:
array(['1850', '1867', '1869', '1870s', '1878', '1885', '1886', '1890',
       '1922', '24', '31', 'accounts', 'aldrich', 'appalachia',
       'appalachian', 'appleton', 'atlantic', 'attended', 'bailey',
       'battle', 'begun', 'bret', 'buried', 'celebrated', 'cemetery',
       'characters', 'charles', 'childhood', 'citation', 'civil',
       'closely', 'colonel', 'color', 'compared', 'considered', 'contact',
       'contributing', 'cotton', 'creating', 'cumberland', 'death',
       'editor', 'eliot', 'evergreen', 'fact', 'family', 'father',
       'favorably', 'female', 'fiction', 'fifteen', 'finishing', 'fought',
       'george', 'grandfather', 'great', 'hardy', 'harte', 'institute',
       'january', 'jewett', 'journal', 'july', 'knoxville', 'lawyer',
       'literature', 'lived', 'local', 'location', 'louis', 'mary',
       'monthly', 'mountain', 'mountains', 'murfreesboro', 'named',
       'nashville', 'necessity', 'needed', 'negative', 'novel', 'novels',
       'number', 'opportunity', 'orne', 'pen', 'people', 'philadelphia',
       'plantation', 'post', 'reading', 'realism', 'region', 'reinforce',
       'resort', 'resorts', 'returning', 'revolutionary', 'sarah',
       'school'], dtype='<U18')

Now, consider using the matrix tf_mat in a predictive model. Here it has 18k+ columns; in general, it is impossible to learn 18k parameters (as in a linear regression) with only 2800 observations. We need a method that is able to handle such models.

Elastic net

Consider a simple linear regression model. We have mentioned that the the ordinary least squares estimator is defined by minimizing the sum of squared residuals:

$$ \text{LEAST SQUARES} \rightarrow \arg\min_{a, b} \left\{ \sum_i \left( y_i - a - b \cdot x_i \right)^2 \right\}$$

The lasso estimator modifies this slightly by adding a penalty term the entices the model to make the slove parameter smaller:

$$ \text{LASSO} \rightarrow \arg\min_{a, b} \left\{ \sum_i \left( y_i - a - b \cdot x_i \right)^2 + \lambda \cdot | b | \right\}$$

For multivariate data, this becomes (for those familiar with vector norms):

$$ \text{LASSO} \rightarrow \arg\min_{\beta} \left\{ || y - \beta X ||_2^2 + \lambda \cdot || b ||_1 \right\}$$

And finally, the elastic net is given by:

$$ \text{ELASTIC NET} \rightarrow \arg\min_{\beta} \left\{ || y - \beta X ||_2^2 + \lambda \cdot \rho || b ||_1 + \lambda \cdot (1 - \rho) || b ||_2^2 \right\}$$

The details for us in this course are not important; what should be taken away is that we have a model that forces slope parameters to be zero unless they are particularly useful in the prediction task. It turns out that this is particularly useful for text prediction.

We can create a logistic elasic net according to the same approach used in other sklearn estimators:

In [9]:
lnet = glmnet.LogitNet()
lnet
Out[9]:
LogitNet(alpha=1, cut_point=1.0, fit_intercept=True, lambda_path=None,
     max_iter=100000, min_lambda_ratio=0.0001, n_jobs=1, n_lambda=100,
     n_splits=3, random_state=None, scoring=None, standardize=True,
     tol=1e-07, verbose=False)

And, as with other estimators, we fit the data using the fit method:

In [10]:
lnet.fit(tf_mat, y_vals)
Out[10]:
LogitNet(alpha=1, cut_point=1.0, fit_intercept=True, lambda_path=None,
     max_iter=100000, min_lambda_ratio=0.0001, n_jobs=1, n_lambda=100,
     n_splits=3, random_state=None, scoring=None, standardize=True,
     tol=1e-07, verbose=False)

As well as constructing predictions using the predict function:

In [11]:
y_pred = lnet.predict(tf_mat)

And see how well it performs:

In [12]:
sklearn.metrics.accuracy_score(y_vals, y_pred)
Out[12]:
0.9264757864969954

Looking at the selected parameters

More than the model itself, though, the most interesting thing about the elastic net is seeing what variables were choosen by the algorithm. To start, it is helpful to wrap up a list that matches each word to its coefficent:

In [13]:
vals = list(zip(words, lnet.coef_[0, :]))
vals[:10]
Out[13]:
[('1850', 0.0),
 ('1867', 0.0),
 ('1869', 0.0),
 ('1870s', 0.0),
 ('1878', 0.0),
 ('1885', 0.0),
 ('1886', 0.0),
 ('1890', 0.0),
 ('1922', 0.0),
 ('24', 0.0)]

And then, sort the results by the coefficent and show the non-zero values.

In [14]:
for x in sorted(vals, key=lambda x: x[1], reverse=True):
    if x[1] != 0:
        print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
poet            =>     0.79
poetry          =>     0.33
fiction         =>    -0.00
author          =>    -0.01
novel           =>    -0.13
novels          =>    -0.19
novelist        =>    -0.50

Remember that poets are coded as 1's and novelists as 0's; so positive terms are correlated with poets and negatives are novelists. Do the results make sense to you?

The value for $\lambda$ in the elastic net is choosen by trying up to 100 values and using a technique for determining which one is best. Sometimes it is also useful to look at non-optimal values, for example if the optimal output contains too many or too few terms to understand the structure of the data. Here, we grab the 30th largest value of the tuning parameter:

In [15]:
vals = list(zip(words, lnet.coef_path_[0, :, 26]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
    if x[1] != 0:
        print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
139             =>     1.06
poet            =>     0.86
poetry          =>     0.37
poems           =>     0.00
fiction         =>    -0.01
author          =>    -0.03
starring        =>    -0.03
novel           =>    -0.14
novels          =>    -0.21
novelist        =>    -0.55

Do the new values make sense to you? Do any seem superfluous?

Another application

Let's use the same dataset applied to a different respones variable: whether the page has been translated into German. Note that positive values are associated with translated pages and negative values are not.

In [16]:
lan_version = np.array(['de' in x for x in wcorp.meta['langs']], dtype=np.int)

lnet = glmnet.LogitNet()
lnet.fit(tf_mat, lan_version)

vals = list(zip(words, lnet.coef_[0, :]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
    if x[1] != 0:
        print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
married         =>     0.15
pulitzer        =>     0.13
wrote           =>     0.04
novels          =>     0.04
novel           =>     0.03
years           =>     0.03
time            =>     0.02
best            =>     0.02
short           =>     0.01
adapted         =>     0.01
film            =>     0.00
won             =>     0.00

Similarly, for French:

In [17]:
lan_version = np.array(['fr' in x for x in wcorp.meta['langs']], dtype=np.int)

lnet = glmnet.LogitNet()
lnet.fit(tf_mat, lan_version)

vals = list(zip(words, lnet.coef_[0, :]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
    if x[1] != 0:
        print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
married         =>     0.11
won             =>     0.07
novels          =>     0.06
adapted         =>     0.06
wrote           =>     0.05
time            =>     0.04
novel           =>     0.04
short           =>     0.02
best            =>     0.01
years           =>     0.01
film            =>     0.01
fiction         =>     0.01
father          =>     0.00

And once more, in Chinese:

In [18]:
lan_version = np.array(['zh' in x for x in wcorp.meta['langs']], dtype=np.int)

lnet = glmnet.LogitNet()
lnet.fit(tf_mat, lan_version)

vals = list(zip(words, lnet.coef_[0, :]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
    if x[1] != 0:
        print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
years           =>     0.08
time            =>     0.04
wrote           =>     0.04
father          =>     0.03
best            =>     0.02
life            =>     0.02
short           =>     0.01
writer          =>     0.01
novel           =>     0.01
novels          =>     0.01
writing         =>     0.00
won             =>     0.00
awarded         =>     0.00
written         =>     0.00

What patterns do you see in the data here? Does it tell you anything about the nature of Wikipedia?

I also found predicting whether a page has more than 2 images to be similarly interesting:

In [19]:
image_flag = wcorp.meta['num_images'].values > 2

lnet = glmnet.LogitNet()
lnet.fit(tf_mat, image_flag)

vals = list(zip(words, lnet.coef_[0, :]))
for x in sorted(vals, key=lambda x: x[1], reverse=True):
    if x[1] != 0:
        print("{0:15s} => {1: 8.2f}".format(x[0], x[1]))
cemetery        =>     0.33
1866            =>     0.22
1886            =>     0.15
famous          =>     0.10
buried          =>     0.10
married         =>     0.08
entered         =>     0.08
semi            =>     0.07
wrote           =>     0.06
1873            =>     0.06
died            =>     0.04
public          =>     0.04
authors         =>     0.04
father          =>     0.04
children        =>     0.04
called          =>     0.03
war             =>     0.03
years           =>     0.03
1907            =>     0.03
life            =>     0.02
presented       =>     0.02
works           =>     0.02
needed          =>     0.02
multiple        =>     0.02
century         =>     0.02
well            =>     0.01
received        =>     0.01
earth           =>     0.01
age             =>     0.01
included        =>     0.01
best            =>     0.01
elected         =>     0.01
series          =>     0.01
early           =>     0.01
fiction         =>     0.00
death           =>     0.00
literature      =>     0.00
author          =>     0.00
school          =>     0.00
press           =>    -0.01

Any take aways from this set of words?