In this tutorial, we are going to build some predictive models from "scratch". That is, without the help of a Python library specifically build for prediction. Such libraries will be the topic of the next few weeks of the course.

We are going to, as usual, build our models using data from Wikipedia. You will grab links from a specific starting page to build a small corpus of items. Then, from these pages, you will consider two prediction tasks:

  • how many internal links does the page have? (regression)
  • is the page translated into German (classification)

I will help you process the input data. Your task is to build the predictive models and assess how well the perform on the dataset.

Loading the libraries

We will make use of the three class modules, as well as numpy:

In [1]:
import wiki
import iplot
import wikitext

import numpy as np
Loading BokehJS ...
In [2]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 2
assert iplot.__version__ >= 3

Get some data

For today, make sure that you download the following data. Uncomment the line, run it, and then recomment it so that it only runs once.

In [3]:
#wiki.bulk_download('philosophy', force=True)

For today, we will take all of the pages referenced on the "important publications in philosophy" page to build a corpus for prediction. We will make a WikiCorpus object to simplify the computation of metrics for the page.

In [4]:
np.random.seed(0)
links = wikitext.get_internal_links('List_of_important_publications_in_philosophy')['ilinks']
links.remove("What_Is_it_Like_to_Be_a_Bat?")
links.remove("What_is_Life?_(Schrödinger)")
links = np.random.permutation(links)
In [5]:
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)

Building training and testing data

Let's now extract the number of internal links on each page, whether the page is translated into German ('de'), and five predictor variables that we will try to use in constructing our models.

In [6]:
num_ilinks = wcorp.meta['num_ilinks'].values
lan_version = np.array(['de' in x for x in wcorp.meta['langs']], dtype=np.int)

num_sections = wcorp.meta['num_sections'].values
num_images = wcorp.meta['num_images'].values
num_elinks = wcorp.meta['num_elinks'].values
num_langs = wcorp.meta['num_langs'].values
num_chars = np.array([len(x) for x in wcorp.meta['doc'].values])

Now, we will consider the first 325 observations the "training" set and the rest of the data (about the same amount) as the "testing" set. To split these out, we could use something like this:

In [7]:
y_train = num_ilinks[:325]
y_test  = num_ilinks[325:]
x_train = num_images[:325]
x_test  = num_images[325:]

You'll need to redefine these depending on the classification task.

Model I: Linear Regression

The first model we will try to build is a linear regression. This is an easy place to start because you have already seen regression models, however now we won't be using the statsmodels module to help us construct the output. Recall that a simple linear regression predicts the response $y$ from a variable $x$ accoring to:

$$ y = a + b \cdot x $$

For some parameters $a$ and $b$. We want to find values for these two parameters such that the sum of absolute values is minimized:

$$ \sum_i | y_i - (a + b \cdot x_i) | $$

We could optimize this using some fancy algorithm. Today we will start by using trial-and-error on various values of $a$ and $b$.

Start by constructing a function linreg_model_eval that takes inputs $x$, $y$, $a$, and $b$ and returns the sum of absolute errors using these as the intercept and slope terms to predict $y$ from $x$.

In [8]:
def linreg_model_eval(x, y, a, b):
    return np.sum(np.abs(y - (a + b* x)))

Now, build a function called linreg_model_best_params that takes as input $x$ and $y$ and returns the best guess for the slope and intercept. You can use the function np.linspace(0, 10, num=100) to generate a range of guesses for the slope and intercept (this would find 100 numbers between 0 and 1). I suggest letting the intercept range from -200 to 200 and the slope from -100 to 100 with about 200-1000 guesses in each. If you find that a result gives an extreme value, try adjusting the range. Remember that you can return two values by seperating them with a comma (i.e., return a, b).

In [9]:
def linreg_model_best_params(x, y):
    best_params = dict(error = float("inf"))
    for a in np.linspace(-5000, 200, num=1000):
        for b in np.linspace(-200, 200, num=1000):
            error = linreg_model_eval(x, y, a, b)
            if error < best_params['error']:
                best_params['error'] = error
                best_params['a'] = a
                best_params['b'] = b
    
    return best_params['a'], best_params['b']

If you are stuck on the above task, here's a sketch as to how to do this. You need to start with a double for loop cycling over all the values and for each returning how good the fit is. That should be relatively straightforward. From there, I suggest storing a value lowest_abs that keeps track of the best prediction value so far and using an if statement to update the final a and b when this is improved on by a particular model.

Now, apply your regression model using the five predictor variables. Print out how well the model performs on the training and testing set (I'll leave it to you to figure how how to write this code).

In [10]:
x_vars = dict(num_sections=num_sections, num_images=num_images, num_elinks=num_elinks,
              num_langs=num_langs, num_chars=num_chars)

for key, val in x_vars.items():
    
    a, b = linreg_model_best_params(val[:325], y_train)
    err = linreg_model_eval(val[325:], y_test, a, b)
    
    print("{0:12s}:  a = {1: 10.02f}, b ={2: 7.01f}, err ={3: 8.0f}".format(key, a, b, err))
num_sections:  a =      38.64, b =   19.0, err =   94047
num_images  :  a =      75.08, b =   37.8, err =   79263
num_elinks  :  a =     132.33, b =    7.0, err =   94168
num_langs   :  a =      95.90, b =    9.4, err =   86372
num_chars   :  a =   -2121.52, b =    0.2, err = 1058439

Describe the results about which metric seems the most predictive. Do the results tell us about how Wikipedia represents knowledge?

Answer: The most predictive variable is the number of images.

Model II: Best Split Regression

Our second model mimics the one that I built in Tutorial 22, but with a bit more room for generalization. Here, we will have three parameters: alpha, a, and b. And the model will be given by:

$$ \widehat{y} = \begin{cases} a, & x \leq \alpha \\ b, & caps > \alpha \end{cases} $$

So, we have not hard coded what the two values are on either side of the split. We will need slightly different functions for the classification and regression tasks. Let's start with regression (I think it should be a bit easier to write).

Write a function splitreg_model_eval that returns the sum of absolute errors for the model.

In [11]:
def splitreg_model_eval(x, y, alpha, a, b):
    y_hat = np.ones_like(y) * a
    y_hat[x > alpha] = b
    
    return np.sum(np.abs(y - y_hat))

Now, as before, we want a function splitreg_model_best_params that takes just the data and returns the best parameters. Here there will be three outputs. To do this, use np.linspace again to cycle over values for alpha, but you should set $a$ and $b$ equal to the median of the data split by the cut off alpha. Something like this should work: np.median(x[x > alpha]). Also, you should let alpha range from np.min(x) to np.max(x).

In [12]:
def splitreg_model_best_params(x, y):
    best_params = dict(error = float("inf"))
    for alpha in np.linspace(np.min(x), np.max(x), num=100, endpoint=False)[1:]:
        a = np.median(y[x <= alpha])
        b = np.median(y[x > alpha])
        error = splitreg_model_eval(x, y, alpha, a, b)
        
        if error < best_params['error']:
            best_params['error'] = error
            best_params['alpha'] = alpha
            best_params['a'] = a
            best_params['b'] = b
    
    return best_params['alpha'], best_params['a'], best_params['b']

Now, apply this to each predictor variable and determine which is the best (again, I leave the method for doing this up to you).

In [13]:
y_train = num_ilinks[:325]
y_test  = num_ilinks[325:]
x_vars = dict(num_sections=num_sections, num_images=num_images, num_elinks=num_elinks,
              num_langs=num_langs, num_chars=num_chars)

for key, val in x_vars.items():
    
    alpha, a, b = splitreg_model_best_params(val[:325], y_train)
    err = splitreg_model_eval(val[325:], y_test, alpha, a, b)
    
    print("{0:12s}:  alpha = {1: 8.02f} a ={2: 3.02f}, b ={3: 7.01f}, err ={4: 7.0f}".format(key, alpha, a, b, err))
num_sections:  alpha =    18.96 a = 189.00, b =  678.0, err =  99621
num_images  :  alpha =     4.24 a = 97.00, b =  508.5, err =  92836
num_elinks  :  alpha =    34.76 a = 192.00, b =  672.0, err =  94227
num_langs   :  alpha =    33.48 a = 193.00, b =  711.0, err =  90007
num_chars   :  alpha =  22653.30 a = 232.00, b =  781.5, err =  98818

Which is most predictive? Is this different than your results above? Which estimator is a better predictor and why do you think this?

Answer:

Model III: Best Split Classification

Using a best split value for classification is similar to the regression case; it just uses a different evaluation function that returns the percentage of pages correctly classified. Write your evaluation function for this here:

In [14]:
def splitclass_model_eval(x, y, alpha, a, b):
    y_hat = np.ones_like(y) * a
    y_hat[x > alpha] = b
    
    return np.mean(y != y_hat)

Then, define a function that returns the best parameters. This should be exactly the same as splitreg_model_best_params with the new evaluation function switched out.

In [15]:
def splitclass_model_best_params(x, y):
    best_params = dict(error = float("inf"))
    for alpha in np.linspace(np.min(x)+0.1, np.max(x)-0.1, num=100, endpoint=False):
        a = np.median(y[x <= alpha])
        b = np.median(y[x > alpha])
        error = splitclass_model_eval(x, y, alpha, a, b)
        
        if error < best_params['error']:
            best_params['error'] = error
            best_params['alpha'] = alpha
            best_params['a'] = a
            best_params['b'] = b
    
    return best_params['alpha'], best_params['a'], best_params['b']

Test this function using the five predictor variables to predict the variable lan_version, whether the page has a German language version.

In [16]:
y_train = lan_version[:325]
y_test  = lan_version[325:]
x_vars = dict(num_sections=num_sections, num_images=num_images, num_elinks=num_elinks,
              num_langs=num_langs, num_chars=num_chars)

for key, val in x_vars.items():
    
    alpha, a, b = splitclass_model_best_params(val[:325], y_train)
    err = splitclass_model_eval(val[325:], y_test, alpha, a, b)
    
    print("{0:12s}:  alpha ={1: 10.02f} a ={2: 3.02f}, b ={3: 3.01f}, err ={4: 0.04f}".format(key, alpha, a, b, err))
num_sections:  alpha =      7.99 a = 0.00, b = 1.0, err = 0.2972
num_images  :  alpha =      2.21 a = 0.00, b = 1.0, err = 0.2642
num_elinks  :  alpha =     12.73 a = 0.00, b = 1.0, err = 0.2712
num_langs   :  alpha =     11.25 a = 0.00, b = 1.0, err = 0.1792
num_chars   :  alpha =   4062.79 a = 0.00, b = 1.0, err = 0.2547

More Practice

If you manage to finish the above tasks before the end of the class, wrap up one more function that takes two inputs: wcorp and a language code. It should print out (it doesn't need to return anything) information about each of the best splits for predicting the language codes for each of the predictor variables. Note that you may want to make use of the format method I used in Tutorial 22. Try running the function for several languages, and perhaps even a different starting page. See if you start to notice anything interesting.

In [17]:
def predict_lang(wcorp, lang="de"):
    lan_version = np.array([lang in x for x in wcorp.meta['langs']], dtype=np.int)

    num_sections = wcorp.meta['num_sections'].values
    num_images = wcorp.meta['num_images'].values
    num_elinks = wcorp.meta['num_elinks'].values
    num_langs = wcorp.meta['num_langs'].values
    num_chars = np.array([len(x) for x in wcorp.meta['doc'].values])
    
    y_train = lan_version[:325]
    y_test  = lan_version[325:]
    x_vars = dict(num_sections=num_sections, num_images=num_images, num_elinks=num_elinks,
                  num_langs=num_langs, num_chars=num_chars)

    for key, val in x_vars.items():

        alpha, a, b = splitclass_model_best_params(val[:325], y_train)
        err = splitclass_model_eval(val[325:], y_test, alpha, a, b)

        print("{0:12s}:  alpha ={1: 10.02f} a ={2: 3.02f}, b ={3: 3.01f}, err ={4: 0.04f}".format(key, alpha, a, b, err))
In [18]:
predict_lang(wcorp, "uk")
num_sections:  alpha =     11.15 a = 0.00, b = 1.0, err = 0.2854
num_images  :  alpha =      5.38 a = 0.00, b = 1.0, err = 0.2406
num_elinks  :  alpha =     34.84 a = 0.00, b = 1.0, err = 0.2877
num_langs   :  alpha =     20.54 a = 0.00, b = 1.0, err = 0.1297
num_chars   :  alpha =  18669.67 a = 0.00, b = 1.0, err = 0.3066