Tutorial 22: Predictive Models

For the next 2.5 weeks, we will be looking at how to build predictive models, with a focus on text and image processing. It will be a nice self-contained introduction, but also a good introduction to my MATH389: Introduction to Statistical Learning course that is being offered this Spring.

What Is Predictive Modeling / "Learning"?

Statistical learning, synonymous with machine learning, is the process of extracting knowledge from data automatically, usually with the goal of making predictions on new, unseen data.

A classical example is a spam filter, for which a user labels incoming mails as either spam or not spam. A machine learning algorithm then "learns" a predictive model from data that distinguishes spam from normal emails, a model which can predict for new emails whether they are spam or not.

Here are some explicit examples:

  • using physical characteristics of animals to predict whether they are carnivores
  • estimate how much a house is worth given properties such as number of bedrooms, square footage and its address
  • predict whether a flight will be delayed given the carrier, scheduled departure time, arrival and departure airports
  • a crime has been reported at a specific place and time in Chicago; what type of crime is it?
  • here is a picture of a flower, what kind of flower is it?
  • given two sentences of text, predict which President used it in a public speech
  • how many page views will a specific Wikipedia page receive tomorrow?

And can look at some explicit examples of these models in the wild:

Learning Algorithms

In my experience, most learning algorithms fall into one of two broad categories:

  • nearest neighbors (local): estimate values of new points by finding previously observed points close to the new ones
  • linear models (global): estimate weights for each parameter; classify new points by summing up these weights

Within these classes, I typically find the need to use only some combination of the following four algorithms:

  1. k-nearest neighbors: a straightforward application of nearest neighbors
  2. gradient boosted trees: adaptively implement nearest neighbors by determining which directions "matter"
  3. elastic net: a linear model with controls on the sizes of the weights
  4. neural networks: iteratively apply collections of elastic nets to learn a hierarchy of increasingly complex weights

If some of these concepts seem hazy at the moment, that is perfectly natural. We'll go into much more detail throughout the next few weeks.

Our Approach

Today I want you to be become familiar with the basic mechanics of building predictive models. Here is my rough schedule for the next few classes:

  • 2018-10-18 (today): mechanics of predictive models
  • 2018-10-23: Using matricies in predictive models; introduction to sklearn
  • 2018-10-25: unsupervised learning and dimensionality reduction
  • 2018-10-30: Neural networks I; introduction to keras
  • 2018-11-01: Neural networks II; application to Wikipedia

That should take us to working on your final project for the semester.

A simple model and example

Let's consider an example here where we have two variables, the number of capital letters in a text message an a classification of whether the message is spam (1) or not (0).

In [1]:
caps = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 4, 5, 5, 6, 6, 8, 8]
spam = [0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1]

And let's consider a model that simply classifies a message as being spam based on a threshold for how many capital letters it contains:

$$ f(caps) = \begin{cases} 0, & caps \leq \alpha \\ 1, & caps > \alpha \end{cases} $$

For some threshold $\alpha$. How might we select the best value of $\alpha$? One approach would be to test a bunch of values and determine what the best value is.

So, consider all of the sensible values:

In [2]:
alpha_vals = list(range(0, 9))
alpha_vals
Out[2]:
[0, 1, 2, 3, 4, 5, 6, 7, 8]

And then we can test for each value how many predictions are correct:

In [3]:
for alpha in alpha_vals:
    error = []
    for x, y in zip(caps, spam):
        error.append(int(int(x > alpha) == y))
    print("alpha={0:d}  percent correct={1:f}".format(alpha, sum(error) / len(error)))
alpha=0  percent correct=0.555556
alpha=1  percent correct=0.722222
alpha=2  percent correct=0.666667
alpha=3  percent correct=0.666667
alpha=4  percent correct=0.611111
alpha=5  percent correct=0.611111
alpha=6  percent correct=0.611111
alpha=7  percent correct=0.611111
alpha=8  percent correct=0.500000

So, the best choice appears to by predicting anything with more than 1 capital letter equal to "spam" and otherwise assuming the message is not spam. It appears that we will be correct 72.2% of the time.

Training and Testing

There is one way we are cheating in this simple example. We are using the same dataset to pick the value of $\alpha$ that we are using to validate that it is a good choice. This doesn't matter very much here, but becomes an issue when comparing models of different complexities. It will always seem — if we use the approach here — that a more complex model is better.

A common method to avoid this issue is to split the dataset into two groups:

  1. A training set, which is used to select the parameters of the model (e.g., the $\alpha$ above).
  2. A testing set, which is use to test how well a model performs on new data.

In some cases you may create these datasets yourself; in others, they may be given to you by an external source.

Classification vs Regression

In the spam example, the thing we were trying to predict was a categorical variable. It has just two values, 'spam' and 'not spam', and we want to guess which of these two states a new message belongs to. There are also other prediction tasks that involve predicting a continuous variable. For example, predicting the price of a house. The first type of problem is called classsification and the second are often called regression. The biggest difference between these for us will be how we measure how good a model is at predicting the response. For classification, just computing the percentage of correct guesses is a good start. For regression typically we use absolute error:

$$ | y - \widehat{y} | $$

Or squared error:

$$ | y - \widehat{y} |^2 $$

Different choices lead to different models, something that we discuss in depth in MATH389.

Numpy

One Python library that will be very important useful for us in predictive models is numpy. I've already used this a bit in the python files created for the class, but we haven't seen them directly yet. I'll explain more on Tuesday, but here is a very minimal introduction.

Typically, numpy is imported as the abbrevition np:

In [4]:
import numpy as np

The key function for us here is called np.array. It converts a Python list into a numpy array. At first it will not appear that much has happened:

In [5]:
caps = np.array(caps)
caps
Out[5]:
array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 4, 5, 5, 6, 6, 8, 8])
In [6]:
spam = np.array(spam)
spam
Out[6]:
array([0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1])

But, the big difference is that we can now perform vector arithmetic on the data. That is, we can operate on the object as a whole without needing to write a bunch of for loops:

In [7]:
caps + 1
Out[7]:
array([1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 5, 6, 6, 7, 7, 9, 9])

And:

In [8]:
spam / (caps + 1)
Out[8]:
array([0.        , 0.        , 1.        , 0.        , 0.        ,
       0.        , 0.        , 0.33333333, 0.        , 0.33333333,
       0.33333333, 0.2       , 0.        , 0.16666667, 0.14285714,
       0.        , 0.11111111, 0.11111111])

There are also a bunch of numeric functions that operate on an entire vector, such as np.mean, np.log, and np.round:

In [9]:
np.mean(spam)
Out[9]:
0.5

And, here is how we would re-write the code to check for the alpha values:

In [10]:
for alpha in alpha_vals:
    error = np.mean((caps > alpha) == spam)
    print("alpha={0:d}  percent correct={1:f}".format(alpha, error))
alpha=0  percent correct=0.555556
alpha=1  percent correct=0.722222
alpha=2  percent correct=0.666667
alpha=3  percent correct=0.666667
alpha=4  percent correct=0.611111
alpha=5  percent correct=0.611111
alpha=6  percent correct=0.611111
alpha=7  percent correct=0.611111
alpha=8  percent correct=0.500000

On the whole, this is quite a bit easier to write and read. This will be even more important, as we will see, when building more complex models.