{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tutorial 23: Predicting Internal Links\n",
"\n",
"In this tutorial, we are going to build some predictive models from \"scratch\".\n",
"That is, without the help of a Python library specifically build for prediction.\n",
"Such libraries will be the topic of the next few weeks of the course.\n",
"\n",
"We are going to, as usual, build our models using data from Wikipedia. You will\n",
"grab links from a specific starting page to build a small corpus of items. Then,\n",
"from these pages, you will consider two prediction tasks:\n",
"\n",
"- how many internal links does the page have? (regression)\n",
"- is the page translated into German (classification)\n",
"\n",
"I will help you process the input data. Your task is to build the predictive \n",
"models and assess how well the perform on the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading the libraries\n",
"\n",
"We will make use of the three class modules, as well as **numpy**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import wiki\n",
"import iplot\n",
"import wikitext\n",
"\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert wiki.__version__ >= 6\n",
"assert wikitext.__version__ >= 2\n",
"assert iplot.__version__ >= 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get some data\n",
"\n",
"For today, make sure that you download the following data. Uncomment the line,\n",
"run it, and then recomment it so that it only runs once."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#wiki.bulk_download('philosophy', force=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For today, we will take all of the pages referenced on the \"important publications\n",
"in philosophy\" page to build a corpus for prediction. We will make a `WikiCorpus`\n",
"object to simplify the computation of metrics for the page."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(0)\n",
"links = wikitext.get_internal_links('List_of_important_publications_in_philosophy')['ilinks']\n",
"links = np.random.permutation(links)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Building training and testing data\n",
"\n",
"Let's now extract the number of internal links on each page, whether the page\n",
"is translated into German ('de'), and five predictor variables that we will try\n",
"to use in constructing our models."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"num_ilinks = wcorp.meta['num_ilinks'].values\n",
"lan_version = np.array(['de' in x for x in wcorp.meta['langs']], dtype=np.int)\n",
"\n",
"num_sections = wcorp.meta['num_sections'].values\n",
"num_images = wcorp.meta['num_images'].values\n",
"num_elinks = wcorp.meta['num_elinks'].values\n",
"num_langs = wcorp.meta['num_langs'].values\n",
"num_chars = np.array([len(x) for x in wcorp.meta['doc'].values])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we will consider the first 325 observations the \"training\" set and the \n",
"rest of the data (about the same amount) as the \"testing\" set. To split these\n",
"out, we could use something like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_train = num_ilinks[:325]\n",
"y_test = num_ilinks[325:]\n",
"x_train = num_images[:325]\n",
"x_test = num_images[325:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll need to redefine these depending on the classification task."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model I: Linear Regression\n",
"\n",
"The first model we will try to build is a linear regression. This is an easy\n",
"place to start because you have already seen regression models, however now\n",
"we won't be using the *statsmodels* module to help us construct the output.\n",
"Recall that a simple linear regression predicts the response $y$ from a variable\n",
"$x$ accoring to:\n",
"\n",
"$$ y = a + b \\cdot x $$\n",
"\n",
"For some parameters $a$ and $b$. We want to find values for these two parameters\n",
"such that the sum of absolute values is minimized:\n",
"\n",
"$$ \\sum_i | y_i - (a + b \\cdot x_i) | $$\n",
"\n",
"We could optimize this using some fancy algorithm. Today we will start by\n",
"using trial-and-error on various values of $a$ and $b$.\n",
"\n",
"Start by constructing a function `linreg_model_eval` that takes inputs $x$, $y$,\n",
"$a$, and $b$ and returns the sum of absolute errors using these as the intercept\n",
"and slope terms to predict $y$ from $x$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def linreg_model_eval(x, y, a, b):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, build a function called `linreg_model_best_params` that takes as input $x$ and $y$\n",
"and returns the best guess for the slope and intercept. You can use the function\n",
"`np.linspace(0, 10, num=100)` to generate a range of guesses for the slope and intercept\n",
"(this would find 100 numbers between 0 and 1). I suggest letting the intercept range from\n",
"-200 to 200 and the slope from -100 to 100 with about 200-1000 guesses in each. If you\n",
"find that a result gives an extreme value, try adjusting the range. Remember that you can\n",
"return two values by seperating them with a comma (i.e., `return a, b`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def linreg_model_best_params(x, y):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are stuck on the above task, here's a sketch as to how to do this. You\n",
"need to start with a double for loop cycling over all the values and for each returning\n",
"how good the fit is. That should be relatively straightforward. From there, I suggest\n",
"storing a value `lowest_abs` that keeps track of the best prediction value so far and\n",
"using an if statement to update the final a and b when this is improved on by a particular\n",
"model.\n",
"\n",
"Now, apply your regression model using the five predictor variables. Print\n",
"out how well the model performs on the training and testing set (I'll leave it\n",
"to you to figure how how to write this code)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Describe the results about which metric seems the most predictive. Do the results\n",
"tell us about how Wikipedia represents knowledge?\n",
"\n",
"**Answer**:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model II: Best Split Regression\n",
"\n",
"Our second model mimics the one that I built in Tutorial 22, but with a bit more\n",
"room for generalization. Here, we will have three parameters: alpha, a, and b.\n",
"And the model will be given by:\n",
"\n",
"$$ \\widehat{y} = \\begin{cases} a, & x \\leq \\alpha \\\\ b, & caps > \\alpha \\end{cases} $$\n",
"\n",
"So, we have not hard coded what the two values are on either side of the split. We will\n",
"need slightly different functions for the classification and regression tasks. Let's\n",
"start with regression (I think it should be a bit easier to write). \n",
"\n",
"Write a function `splitreg_model_eval` that returns the sum of absolute errors for the\n",
"model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def splitreg_model_eval(x, y, alpha, a, b):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, as before, we want a function `splitreg_model_best_params` that takes just\n",
"the data and returns the best parameters. Here there will be three outputs. To\n",
"do this, use `np.linspace` again to cycle over values for `alpha`, but you\n",
"should set $a$ and $b$ equal to the median of the data split by the cut off `alpha`.\n",
"Something like this should work: `np.median(x[x > alpha])`. Also, you should let\n",
"`alpha` range from `np.min(x)` to `np.max(x)`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def splitreg_model_best_params(x, y):\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, apply this to each predictor variable and determine which is the best (again, I leave\n",
"the method for doing this up to you)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which is most predictive? Is this different than your results above? Which estimator\n",
"is a better predictor and why do you think this?\n",
"\n",
"**Answer**:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model III: Best Split Classification\n",
"\n",
"Using a best split value for classification is similar to the regression case; it\n",
"just uses a different evaluation function that returns the percentage of pages \n",
"correctly classified. Write your evaluation function for this here:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def splitclass_model_eval(x, y, alpha, a, b):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, define a function that returns the best parameters. This should be exactly\n",
"the same as `splitreg_model_best_params` with the new evaluation function switched out."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def splitclass_model_best_params(x, y):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Test this function using the five predictor variables to predict the variable\n",
"`lan_version`, whether the page has a German language version."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which variable seems the most predictive? Do you have any intuition for why this variable\n",
"is the most predictive?\n",
"\n",
"**Answer**:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### More Practice\n",
"\n",
"If you manage to finish the above tasks before the end of the class, wrap up one\n",
"more function that takes two inputs: `wcorp` and a language code. It should print\n",
"out (it doesn't need to return anything) information about each of the best splits\n",
"for predicting the language codes for each of the predictor variables. Note that you\n",
"may want to make use of the `format` method I used in Tutorial 22. Try running the function\n",
"for several languages, and perhaps even a different starting page. See if you start to\n",
"notice anything interesting."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def predict_lang(wcorp, lang=\"de\"):\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}