## Tutorial 24: Machine Learning in Python with scikit-learn¶

Today we will replace our custom code for building predictive models with sklearn, a popular module for many tasks in machine learning. You'll find that most of the well-known predictive models exist in the module, and many others extend the same structures when implementing new models.

We will make use of the three class modules, as well as numpy:

In [1]:
import wiki
import iplot
import wikitext

import numpy as np

import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [2]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 2
assert iplot.__version__ >= 3


### Get some data¶

For today, we will once again take links from the "important publications in philosophy" page to build a corpus for prediction. We will make a WikiCorpus object to simplify the computation of metrics for the page. Below I have removed two pages that give our Windows users some trouble.

In [3]:
np.random.seed(0)

In [4]:
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)


As before, we now extract the number of internal links on each page, whether the page is translated into German ('de'), and five predictor variables that we will try to use in constructing our models.

In [5]:
num_ilinks = wcorp.meta['num_ilinks'].values
lan_version = np.array(['de' in x for x in wcorp.meta['langs']], dtype=np.int)

num_sections = wcorp.meta['num_sections'].values
num_images = wcorp.meta['num_images'].values
num_langs = wcorp.meta['num_langs'].values
num_chars = np.array([len(x) for x in wcorp.meta['doc'].values])


### Predictor matrix¶

Last time we somewhat awkwardly kep the five predictor variables seperate from one another. In my solutions for Tutorial 23, I put them all into a list, but this was still somewhat clunky. Now, let's combine them together using numpy in a a single matrix of numbers:

In [6]:
x = np.stack([num_sections, num_images, num_elinks, num_langs, num_chars], axis=1)
x

Out[6]:
array([[   20,     7,    42,    38, 22995],
[    4,     9,    16,     5,  1536],
[   26,     5,     0,     8,  1646],
...,
[    5,     1,     1,    23,  5486],
[   11,     8,     7,     2, 12488],
[   21,     2,    28,     1, 21739]])

This two-dimensional array of numbers should remind you of the term-frequency matrix that I showed when introducing the gensim module. The array is arranged so that it has one row for each sample of data and one column for each variable. There are few helpful things to know about how numpy arrays work. First, they have a shape attribue that us the dimensions of the array. Here, we have 734 rows and 5 columns:

In [7]:
x.shape

Out[7]:
(734, 5)

Also, to select a subset of the matrix we use a similar slicing notation that we saw in tutorial 4 and 9. However, now we need notation to describe both the rows and columns, which are seperated by a comma. A colon indicates that all rows/colums should be taken. Look at these examples:

In [8]:
x[0, :] # the first row, all columns

Out[8]:
array([   20,     7,    42,    38, 22995])
In [9]:
x[:3, :] # the first 3 rows and all columns

Out[9]:
array([[   20,     7,    42,    38, 22995],
[    4,     9,    16,     5,  1536],
[   26,     5,     0,     8,  1646]])
In [10]:
x[:2, :2] # first 2 rows and first three columns

Out[10]:
array([[20,  7],
[ 4,  9]])

The slicing notation will be useful as we construct models from the data. For example, let's now create the training and testing responses and matricies:

In [11]:
y_train = num_ilinks[:325]
x_train = x[:325, :]
x_test  = x[325:, :]


The first 325 observations constitute the "training" set and the rest of the data (of about the same amount) as the "testing" set. We will use these throughout the tutorial.

### Using sklearn¶

Now, on to actually using the sklearn module. We will see first how to build a linear regression, but the nice thing about sklearn is that a similar set of steps can be used to apply almost any algorithm to the data.

Start by constructing an instance of the model you want to build:

In [12]:
reg = sklearn.linear_model.LinearRegression()
reg

Out[12]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Printing out the model, as above, is not nessisary, but does show use all of the input choices available for our model.

Next, call the fit method using the training data (the training matrix x followed by the response y) to learn the parameters of the model using the training data:

In [13]:
reg.fit(x_train, y_train)

Out[13]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Notice that this changes the model reg directly and we do not need to save the result. That is, sklearn uses an object oriented design.

Now, if we call the predict method on a data matrix, predictions from the model are returned as a numpy array:

In [14]:
reg.predict(x_train)

Out[14]:
array([ 459.69369344,  267.0394738 ,  298.18158723,  334.28716325,
287.61232893,  909.93387046,  169.96103868,  288.45698427,
220.27976847,  312.81556344,  160.93476764,  253.3579743 ,
474.30259278,  222.82417467,  555.93542416,  711.85533888,
390.58032236,  153.91555085,  368.38886173,  751.4005149 ,
388.66635076,  212.02411094,  283.00334786,  167.42525833,
743.79858778,  176.59337022,  344.9619415 ,  414.32787335,
293.69821472,  226.64519057,  160.04164107,  658.16618618,
165.50698535,  245.75619923,  384.61681313,  626.55291787,
576.56390853,  316.77482425,  283.88517226,  368.84065429,
409.57355253,  146.95944244,  228.20014591,  242.07500088,
275.01499055,  103.91516218,  536.07803137,  429.25492705,
404.16515181,  258.86427816,  549.63950403, 1700.04885203,
1567.21457878,  498.56578449, 1368.77762545,  298.06874763,
191.73862501,  193.27732807,  335.53958739,  182.19648772,
140.48944255,  403.94733101,  447.17070538,  359.82795216,
394.78407141,  223.73583917,  348.78640878,  255.44874759,
211.20783711, 1038.61151621,  179.61125295,  262.39117185,
319.22101973,  413.3510451 ,  221.45523586,  192.08905515,
333.47373239,  285.09747656,  300.34836061,   98.31736048,
1007.90195087,  160.52289852,  205.86405145,  667.31963717,
240.71936334,  485.60323144,  637.88927626,  218.62805132,
654.01949509,  622.56117272,  341.48773544,  146.96549451,
185.92380465, 1026.26227462,  908.72077099,  986.59958202,
361.7245427 ,  326.95355457,  153.88708866,  994.58449165,
431.04678827,  319.50893333,  412.00410588,  183.65870031,
117.91879574,  254.22843854,  775.06089495,  120.77245718,
877.15136634,  341.00857192,  392.24702004,  515.92704663,
174.32437193, 1419.2978601 ,  169.83986941,  318.40534686,
210.1257942 ,  399.85803908,  533.46219672, 1102.89333325,
1385.06909729, 1278.36237359,  645.40013668,  244.11040973,
377.62868255,  626.47486675,  496.59607769,  764.6791595 ,
719.43080397,  116.60987453,  267.06347548,  251.01592776,
374.12203625,  365.91677926, 1140.6844781 ,  354.51243976,
272.81444639,  408.25211145,  617.96991213,  264.51790919,
182.33710417,  200.92518217,  432.6819557 ,  548.75896658,
528.33614372,  202.56748295,  426.75198704,  168.27302535,
843.92038348, 1079.73195243,  357.30135874,  467.58524472,
324.17982412,  915.57855039,  146.4519328 ,  159.94928939,
358.38545163,  398.35071678,  812.88029187,  237.83715123,
393.58118644,  248.43909643,  159.69653658,  236.29688561,
227.83564463, 1389.15912354,  229.89777622,  780.87467358,
595.79341614,  632.49217799,  351.0946228 , 1046.49641019,
333.47011421,  506.11634864,  635.6430656 ,  180.47194938,
993.73705547,  619.25297365,  498.11668959,  247.14191332,
281.10747355,  660.77563398,  782.9418734 ,  859.19234477,
512.2424625 , 1212.30354527,  162.45536977,  163.32359485,
365.24140383,  411.27037857, 1524.08029565,  303.89463449,
292.71264631,  407.8003204 ,  349.73814607,  204.47215541,
255.29236869,  892.21658755,  162.10406889,  363.39350507,
206.57912985,  162.35800836,  245.13417999,  246.097987  ,
452.29860456, 1358.80092605,  346.21333883,  336.55057767,
259.87043712,  510.79883685,  789.13789737,  193.07713657,
151.46136776,  628.16361133,  155.89319876, 1173.7141839 ,
354.31814873,  260.58154006,  179.29725997,  506.09786604,
604.21548518, 1472.94125513,  202.50445159,  400.89780404,
149.72564881,  268.95954296,  387.40821595,  219.43098269,
146.33436153,  204.83751177,  193.6471401 ,  666.21694988,
154.85759157,  496.39034136,  339.53014626,  639.02191356,
291.93487417,  420.82312146,  388.78716528,  174.39617181,
728.02003197,  576.95504471,  216.53231772,  131.2407959 ,
452.42670218,  459.34523211,  194.71384065,  621.68946595,
454.90100714,  770.55423579,  161.24650644,  221.3372912 ,
152.04191589,  185.71409326,  277.14817934,  192.56493698,
261.25900794,  452.16028087,  217.18903031, 1294.17428306,
233.46908775,  392.50025719,  242.18001932,  803.42638456,
177.95353082,  428.97755153,  267.24233111,  336.95820956,
1230.6973222 ,  246.15388271, 1017.94597395,  209.1812424 ,
426.11934634,  198.49556409,  231.72907569,  154.82599403,
787.21876445,  541.14417301,  344.39365995,  283.62045218,
323.33205526,  233.6664513 ,  672.51872877,  232.72187809,
175.38992268, 1561.21886925,  418.55279954,  377.54249682,
254.00507949,  624.95172934,  131.8578389 ,  431.72387945,
176.77028895,  190.59552648,  176.09741626,  185.52527071,
500.45042822,  396.49349148,  349.74688639,  423.80627689,
269.31550222,  250.74098734, 1242.88091003,  367.71530777,
167.41748805,  220.1042115 ,  359.51101529,  197.16000824,
482.71265214,  140.01832597,  223.87047423,  234.20221402,
423.38759553,  433.07591044,  421.20885941,  428.84855574,
198.7574318 ,  824.98677875,  711.01138195,  319.30035389,
425.75983436,  122.55907715,  406.62260083,  179.56228692,
281.35121908])

If we want to see the coefficents in the model itself, call the .coef_ property:

In [15]:
reg.coef_

Out[15]:
array([3.74491875e+00, 1.45493961e+01, 3.32355174e-01, 4.05670795e+00,
9.00401459e-04])

These methods — fit, predict, and coef_ — exist for all sklearn estimators whenever they make sense for the given model.

### Logistic regression¶

A very popular variation of linear regression, called logistic regression, exists to work with classification tasks. The details are beyond our similar treatment today, but let's see how the model works. There are some extra features available for classification tasks in sklearn.

Start by redefining the response variable to be whether a page appears in German:

In [16]:
y_train = lan_version[:325]
y_test  = lan_version[325:]


Next, construct the model:

In [17]:
logreg = sklearn.linear_model.LogisticRegression()
logreg

Out[17]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

And fit it to the data (this work exactly the same as the regression problem):

In [18]:
logreg.fit(x_train, y_train)

Out[18]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

Now, if we call the prediction function it will spit out predictions that are either 1 (page available in German) or 0 (page not available in German).

In [19]:
logreg.predict(x_train)

Out[19]:
array([1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1,
1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,
0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,
1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1,
1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1])

That is great, and exactly what we often want when doing prediction tasks. In some situations, however, we do not want the predictions themselves but rather an estimate of the probability that an input has a page in German. To get that call the method predict_proba (here, showing just the first 10 samples for illustration):

In [20]:
logreg.predict_proba(x_train)[:10, :]

Out[20]:
array([[0.03085589, 0.96914411],
[0.40008294, 0.59991706],
[0.65667317, 0.34332683],
[0.40686608, 0.59313392],
[0.02402504, 0.97597496],
[0.0043057 , 0.9956943 ],
[0.44171223, 0.55828777],
[0.44365544, 0.55634456],
[0.52915095, 0.47084905],
[0.32351262, 0.67648738]])

The first column is the probability of observering a 0 and the second is the probability of observering a 1.

### Evaluating the model¶

There are also a number of helpful functions in sklearn for preprocessing our data as well as evaluating the results. Let's produce the predicted classes for the testing set from our Logistic regression:

In [21]:
y_test_hat = logreg.predict(x_test)


There are two metrics that I like to spend most of my time looking at. Namely, the confusion matrix (which shows how many items are mis-classified):

In [22]:
sklearn.metrics.confusion_matrix(y_test, y_test_hat)

Out[22]:
array([[ 71,  49],
[ 19, 270]])

As well as the accuracy score:

In [23]:
sklearn.metrics.accuracy_score(y_test, y_test_hat)

Out[23]:
0.8337408312958435

Together, these give a good picture of how well the model performs.

### Try another model¶

Now, for some practice. Look at all of the available models from sklearn:

Pick something that seems interesting, and use this model to make predictions on the training set. Then, compute the confusion matrix and accuracy.

Does your model outperform the Logitic example?

conda install -c conda-forge glmnet
import glmnet