## Tutorial 22: Predictive Models¶

For the next 2.5 weeks, we will be looking at how to build predictive models, with a focus on text and image processing. It will be a nice self-contained introduction, but also a good introduction to my MATH389: Introduction to Statistical Learning course that is being offered this Spring.

### What Is Predictive Modeling / "Learning"?¶

Statistical learning, synonymous with machine learning, is the process of extracting knowledge from data automatically, usually with the goal of making predictions on new, unseen data.

A classical example is a spam filter, for which a user labels incoming mails as either spam or not spam. A machine learning algorithm then "learns" a predictive model from data that distinguishes spam from normal emails, a model which can predict for new emails whether they are spam or not.

Here are some explicit examples:

• using physical characteristics of animals to predict whether they are carnivores
• estimate how much a house is worth given properties such as number of bedrooms, square footage and its address
• predict whether a flight will be delayed given the carrier, scheduled departure time, arrival and departure airports
• a crime has been reported at a specific place and time in Chicago; what type of crime is it?
• here is a picture of a flower, what kind of flower is it?
• given two sentences of text, predict which President used it in a public speech

And can look at some explicit examples of these models in the wild:

### Learning Algorithms¶

In my experience, most learning algorithms fall into one of two broad categories:

• nearest neighbors (local): estimate values of new points by finding previously observed points close to the new ones
• linear models (global): estimate weights for each parameter; classify new points by summing up these weights

Within these classes, I typically find the need to use only some combination of the following four algorithms:

1. k-nearest neighbors: a straightforward application of nearest neighbors
2. gradient boosted trees: adaptively implement nearest neighbors by determining which directions "matter"
3. elastic net: a linear model with controls on the sizes of the weights
4. neural networks: iteratively apply collections of elastic nets to learn a hierarchy of increasingly complex weights

If some of these concepts seem hazy at the moment, that is perfectly natural. We'll go into much more detail throughout the next few weeks.

### Our Approach¶

Today I want you to be become familiar with the basic mechanics of building predictive models. Here is my rough schedule for the next few classes:

• 2018-10-18 (today): mechanics of predictive models
• 2018-10-23: Using matricies in predictive models; introduction to sklearn
• 2018-10-25: unsupervised learning and dimensionality reduction
• 2018-10-30: Neural networks I; introduction to keras
• 2018-11-01: Neural networks II; application to Wikipedia

That should take us to working on your final project for the semester.

### A simple model and example¶

Let's consider an example here where we have two variables, the number of capital letters in a text message an a classification of whether the message is spam (1) or not (0).

In [1]:
caps = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 4, 5, 5, 6, 6, 8, 8]
spam = [0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1]


And let's consider a model that simply classifies a message as being spam based on a threshold for how many capital letters it contains:

$$f(caps) = \begin{cases} 0, & caps \leq \alpha \\ 1, & caps > \alpha \end{cases}$$

For some threshold $\alpha$. How might we select the best value of $\alpha$? One approach would be to test a bunch of values and determine what the best value is.

So, consider all of the sensible values:

In [2]:
alpha_vals = list(range(0, 9))
alpha_vals

Out[2]:
[0, 1, 2, 3, 4, 5, 6, 7, 8]

And then we can test for each value how many predictions are correct:

In [3]:
for alpha in alpha_vals:
error = []
for x, y in zip(caps, spam):
error.append(int(int(x > alpha) == y))
print("alpha={0:d}  percent correct={1:f}".format(alpha, sum(error) / len(error)))

alpha=0  percent correct=0.555556
alpha=1  percent correct=0.722222
alpha=2  percent correct=0.666667
alpha=3  percent correct=0.666667
alpha=4  percent correct=0.611111
alpha=5  percent correct=0.611111
alpha=6  percent correct=0.611111
alpha=7  percent correct=0.611111
alpha=8  percent correct=0.500000


So, the best choice appears to by predicting anything with more than 1 capital letter equal to "spam" and otherwise assuming the message is not spam. It appears that we will be correct 72.2% of the time.

## Training and Testing¶

There is one way we are cheating in this simple example. We are using the same dataset to pick the value of $\alpha$ that we are using to validate that it is a good choice. This doesn't matter very much here, but becomes an issue when comparing models of different complexities. It will always seem — if we use the approach here — that a more complex model is better.

A common method to avoid this issue is to split the dataset into two groups:

1. A training set, which is used to select the parameters of the model (e.g., the $\alpha$ above).
2. A testing set, which is use to test how well a model performs on new data.

In some cases you may create these datasets yourself; in others, they may be given to you by an external source.

### Classification vs Regression¶

In the spam example, the thing we were trying to predict was a categorical variable. It has just two values, 'spam' and 'not spam', and we want to guess which of these two states a new message belongs to. There are also other prediction tasks that involve predicting a continuous variable. For example, predicting the price of a house. The first type of problem is called classsification and the second are often called regression. The biggest difference between these for us will be how we measure how good a model is at predicting the response. For classification, just computing the percentage of correct guesses is a good start. For regression typically we use absolute error:

$$| y - \widehat{y} |$$

Or squared error:

$$| y - \widehat{y} |^2$$

Different choices lead to different models, something that we discuss in depth in MATH389.

### Numpy¶

One Python library that will be very important useful for us in predictive models is numpy. I've already used this a bit in the python files created for the class, but we haven't seen them directly yet. I'll explain more on Tuesday, but here is a very minimal introduction.

Typically, numpy is imported as the abbrevition np:

In [4]:
import numpy as np


The key function for us here is called np.array. It converts a Python list into a numpy array. At first it will not appear that much has happened:

In [5]:
caps = np.array(caps)
caps

Out[5]:
array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 4, 5, 5, 6, 6, 8, 8])
In [6]:
spam = np.array(spam)
spam

Out[6]:
array([0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1])

But, the big difference is that we can now perform vector arithmetic on the data. That is, we can operate on the object as a whole without needing to write a bunch of for loops:

In [7]:
caps + 1

Out[7]:
array([1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 5, 6, 6, 7, 7, 9, 9])

And:

In [8]:
spam / (caps + 1)

Out[8]:
array([0.        , 0.        , 1.        , 0.        , 0.        ,
0.        , 0.        , 0.33333333, 0.        , 0.33333333,
0.33333333, 0.2       , 0.        , 0.16666667, 0.14285714,
0.        , 0.11111111, 0.11111111])

There are also a bunch of numeric functions that operate on an entire vector, such as np.mean, np.log, and np.round:

In [9]:
np.mean(spam)

Out[9]:
0.5

And, here is how we would re-write the code to check for the alpha values:

In [10]:
for alpha in alpha_vals:
error = np.mean((caps > alpha) == spam)
print("alpha={0:d}  percent correct={1:f}".format(alpha, error))

alpha=0  percent correct=0.555556
alpha=1  percent correct=0.722222
alpha=2  percent correct=0.666667
alpha=3  percent correct=0.666667
alpha=4  percent correct=0.611111
alpha=5  percent correct=0.611111
alpha=6  percent correct=0.611111
alpha=7  percent correct=0.611111
alpha=8  percent correct=0.500000


On the whole, this is quite a bit easier to write and read. This will be even more important, as we will see, when building more complex models.