In this tutorial, we are going to build some predictive models from "scratch". That is, without the help of a Python library specifically build for prediction. Such libraries will be the topic of the next few weeks of the course.

We are going to, as usual, build our models using data from Wikipedia. You will grab links from a specific starting page to build a small corpus of items. Then, from these pages, you will consider two prediction tasks:

- how many internal links does the page have? (regression)
- is the page translated into German (classification)

I will help you process the input data. Your task is to build the predictive models and assess how well the perform on the dataset.

We will make use of the three class modules, as well as **numpy**:

In [1]:

```
import wiki
import iplot
import wikitext
import numpy as np
```

In [2]:

```
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 2
assert iplot.__version__ >= 3
```

For today, make sure that you download the following data. Uncomment the line, run it, and then recomment it so that it only runs once.

In [3]:

```
#wiki.bulk_download('philosophy', force=True)
```

For today, we will take all of the pages referenced on the "important publications
in philosophy" page to build a corpus for prediction. We will make a `WikiCorpus`

object to simplify the computation of metrics for the page.

In [4]:

```
np.random.seed(0)
links = wikitext.get_internal_links('List_of_important_publications_in_philosophy')['ilinks']
links.remove("What_Is_it_Like_to_Be_a_Bat?")
links.remove("What_is_Life?_(SchrÃ¶dinger)")
links = np.random.permutation(links)
```

In [5]:

```
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)
```

Let's now extract the number of internal links on each page, whether the page is translated into German ('de'), and five predictor variables that we will try to use in constructing our models.

In [6]:

```
num_ilinks = wcorp.meta['num_ilinks'].values
lan_version = np.array(['de' in x for x in wcorp.meta['langs']], dtype=np.int)
num_sections = wcorp.meta['num_sections'].values
num_images = wcorp.meta['num_images'].values
num_elinks = wcorp.meta['num_elinks'].values
num_langs = wcorp.meta['num_langs'].values
num_chars = np.array([len(x) for x in wcorp.meta['doc'].values])
```

Now, we will consider the first 325 observations the "training" set and the rest of the data (about the same amount) as the "testing" set. To split these out, we could use something like this:

In [7]:

```
y_train = num_ilinks[:325]
y_test = num_ilinks[325:]
x_train = num_images[:325]
x_test = num_images[325:]
```

You'll need to redefine these depending on the classification task.

The first model we will try to build is a linear regression. This is an easy
place to start because you have already seen regression models, however now
we won't be using the *statsmodels* module to help us construct the output.
Recall that a simple linear regression predicts the response $y$ from a variable
$x$ accoring to:

$$ y = a + b \cdot x $$

For some parameters $a$ and $b$. We want to find values for these two parameters such that the sum of absolute values is minimized:

$$ \sum_i | y_i - (a + b \cdot x_i) | $$

We could optimize this using some fancy algorithm. Today we will start by using trial-and-error on various values of $a$ and $b$.

Start by constructing a function `linreg_model_eval`

that takes inputs $x$, $y$,
$a$, and $b$ and returns the sum of absolute errors using these as the intercept
and slope terms to predict $y$ from $x$.

In [8]:

```
def linreg_model_eval(x, y, a, b):
return np.sum(np.abs(y - (a + b* x)))
```

Now, build a function called `linreg_model_best_params`

that takes as input $x$ and $y$
and returns the best guess for the slope and intercept. You can use the function
`np.linspace(0, 10, num=100)`

to generate a range of guesses for the slope and intercept
(this would find 100 numbers between 0 and 1). I suggest letting the intercept range from
-200 to 200 and the slope from -100 to 100 with about 200-1000 guesses in each. If you
find that a result gives an extreme value, try adjusting the range. Remember that you can
return two values by seperating them with a comma (i.e., `return a, b`

).

In [9]:

```
def linreg_model_best_params(x, y):
best_params = dict(error = float("inf"))
for a in np.linspace(-5000, 200, num=1000):
for b in np.linspace(-200, 200, num=1000):
error = linreg_model_eval(x, y, a, b)
if error < best_params['error']:
best_params['error'] = error
best_params['a'] = a
best_params['b'] = b
return best_params['a'], best_params['b']
```

If you are stuck on the above task, here's a sketch as to how to do this. You
need to start with a double for loop cycling over all the values and for each returning
how good the fit is. That should be relatively straightforward. From there, I suggest
storing a value `lowest_abs`

that keeps track of the best prediction value so far and
using an if statement to update the final a and b when this is improved on by a particular
model.

Now, apply your regression model using the five predictor variables. Print out how well the model performs on the training and testing set (I'll leave it to you to figure how how to write this code).

In [10]:

```
x_vars = dict(num_sections=num_sections, num_images=num_images, num_elinks=num_elinks,
num_langs=num_langs, num_chars=num_chars)
for key, val in x_vars.items():
a, b = linreg_model_best_params(val[:325], y_train)
err = linreg_model_eval(val[325:], y_test, a, b)
print("{0:12s}: a = {1: 10.02f}, b ={2: 7.01f}, err ={3: 8.0f}".format(key, a, b, err))
```

Describe the results about which metric seems the most predictive. Do the results tell us about how Wikipedia represents knowledge?

**Answer**: The most predictive variable is the number of images.

Our second model mimics the one that I built in Tutorial 22, but with a bit more room for generalization. Here, we will have three parameters: alpha, a, and b. And the model will be given by:

$$ \widehat{y} = \begin{cases} a, & x \leq \alpha \\ b, & caps > \alpha \end{cases} $$

So, we have not hard coded what the two values are on either side of the split. We will need slightly different functions for the classification and regression tasks. Let's start with regression (I think it should be a bit easier to write).

Write a function `splitreg_model_eval`

that returns the sum of absolute errors for the
model.

In [11]:

```
def splitreg_model_eval(x, y, alpha, a, b):
y_hat = np.ones_like(y) * a
y_hat[x > alpha] = b
return np.sum(np.abs(y - y_hat))
```

Now, as before, we want a function `splitreg_model_best_params`

that takes just
the data and returns the best parameters. Here there will be three outputs. To
do this, use `np.linspace`

again to cycle over values for `alpha`

, but you
should set $a$ and $b$ equal to the median of the data split by the cut off `alpha`

.
Something like this should work: `np.median(x[x > alpha])`

. Also, you should let
`alpha`

range from `np.min(x)`

to `np.max(x)`

.

In [12]:

```
def splitreg_model_best_params(x, y):
best_params = dict(error = float("inf"))
for alpha in np.linspace(np.min(x), np.max(x), num=100, endpoint=False)[1:]:
a = np.median(y[x <= alpha])
b = np.median(y[x > alpha])
error = splitreg_model_eval(x, y, alpha, a, b)
if error < best_params['error']:
best_params['error'] = error
best_params['alpha'] = alpha
best_params['a'] = a
best_params['b'] = b
return best_params['alpha'], best_params['a'], best_params['b']
```

Now, apply this to each predictor variable and determine which is the best (again, I leave the method for doing this up to you).

In [13]:

```
y_train = num_ilinks[:325]
y_test = num_ilinks[325:]
x_vars = dict(num_sections=num_sections, num_images=num_images, num_elinks=num_elinks,
num_langs=num_langs, num_chars=num_chars)
for key, val in x_vars.items():
alpha, a, b = splitreg_model_best_params(val[:325], y_train)
err = splitreg_model_eval(val[325:], y_test, alpha, a, b)
print("{0:12s}: alpha = {1: 8.02f} a ={2: 3.02f}, b ={3: 7.01f}, err ={4: 7.0f}".format(key, alpha, a, b, err))
```

Which is most predictive? Is this different than your results above? Which estimator is a better predictor and why do you think this?

**Answer**:

Using a best split value for classification is similar to the regression case; it just uses a different evaluation function that returns the percentage of pages correctly classified. Write your evaluation function for this here:

In [14]:

```
def splitclass_model_eval(x, y, alpha, a, b):
y_hat = np.ones_like(y) * a
y_hat[x > alpha] = b
return np.mean(y != y_hat)
```

Then, define a function that returns the best parameters. This should be exactly
the same as `splitreg_model_best_params`

with the new evaluation function switched out.

In [15]:

```
def splitclass_model_best_params(x, y):
best_params = dict(error = float("inf"))
for alpha in np.linspace(np.min(x)+0.1, np.max(x)-0.1, num=100, endpoint=False):
a = np.median(y[x <= alpha])
b = np.median(y[x > alpha])
error = splitclass_model_eval(x, y, alpha, a, b)
if error < best_params['error']:
best_params['error'] = error
best_params['alpha'] = alpha
best_params['a'] = a
best_params['b'] = b
return best_params['alpha'], best_params['a'], best_params['b']
```

Test this function using the five predictor variables to predict the variable
`lan_version`

, whether the page has a German language version.

In [16]:

```
y_train = lan_version[:325]
y_test = lan_version[325:]
x_vars = dict(num_sections=num_sections, num_images=num_images, num_elinks=num_elinks,
num_langs=num_langs, num_chars=num_chars)
for key, val in x_vars.items():
alpha, a, b = splitclass_model_best_params(val[:325], y_train)
err = splitclass_model_eval(val[325:], y_test, alpha, a, b)
print("{0:12s}: alpha ={1: 10.02f} a ={2: 3.02f}, b ={3: 3.01f}, err ={4: 0.04f}".format(key, alpha, a, b, err))
```

If you manage to finish the above tasks before the end of the class, wrap up one
more function that takes two inputs: `wcorp`

and a language code. It should print
out (it doesn't need to return anything) information about each of the best splits
for predicting the language codes for each of the predictor variables. Note that you
may want to make use of the `format`

method I used in Tutorial 22. Try running the function
for several languages, and perhaps even a different starting page. See if you start to
notice anything interesting.

In [17]:

```
def predict_lang(wcorp, lang="de"):
lan_version = np.array([lang in x for x in wcorp.meta['langs']], dtype=np.int)
num_sections = wcorp.meta['num_sections'].values
num_images = wcorp.meta['num_images'].values
num_elinks = wcorp.meta['num_elinks'].values
num_langs = wcorp.meta['num_langs'].values
num_chars = np.array([len(x) for x in wcorp.meta['doc'].values])
y_train = lan_version[:325]
y_test = lan_version[325:]
x_vars = dict(num_sections=num_sections, num_images=num_images, num_elinks=num_elinks,
num_langs=num_langs, num_chars=num_chars)
for key, val in x_vars.items():
alpha, a, b = splitclass_model_best_params(val[:325], y_train)
err = splitclass_model_eval(val[325:], y_test, alpha, a, b)
print("{0:12s}: alpha ={1: 10.02f} a ={2: 3.02f}, b ={3: 3.01f}, err ={4: 0.04f}".format(key, alpha, a, b, err))
```

In [18]:

```
predict_lang(wcorp, "uk")
```