Tutorial 27: Neural networks, deep learning, and keras

In this tutorial, you will get a very basic introduction to neural networks and how to build them in Python. Let us start by loading all of our standard modules and scripts.

In [1]:
import wiki
import iplot
import wikitext

import numpy as np

import numpy as np
import matplotlib.pyplot as plt
import sklearn
Loading BokehJS ...
In [2]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 2
assert iplot.__version__ >= 3

For today, we will once again take links from the "important publications in philosophy" page to build a corpus for prediction. We will make a WikiCorpus object to simplify the computation of metrics for the page. Below I have removed two pages that give our Windows users some trouble.

In [3]:
np.random.seed(0)
links = wikitext.get_internal_links('List_of_important_publications_in_philosophy')['ilinks']
links.remove("What_Is_it_Like_to_Be_a_Bat?")
links.remove("What_is_Life?_(Schrödinger)")
links = np.random.permutation(links)
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)

And, again, we will grab two potential response variables (one continuous variable and one categorical one) and stack them together in a numeric numpy array.

In [4]:
num_ilinks = wcorp.meta['num_ilinks'].values
lan_version = np.array(['ru' in x for x in wcorp.meta['langs']], dtype=np.int)

num_sections = wcorp.meta['num_sections'].values
num_images = wcorp.meta['num_images'].values
num_elinks = wcorp.meta['num_elinks'].values
num_langs = wcorp.meta['num_langs'].values
num_chars = np.array([len(x) for x in wcorp.meta['doc'].values])

x = np.stack([num_sections, num_images, num_elinks, num_langs, num_chars], axis=1)

Neural networks

Neural networks, or deep learning, is often made to sound like a fancy, scary, impossible to understand thing. I try to think of them as just another way of building a predictive model (albeit, an important one). I cannot go into too much detail given the time-constraint, but let's talk about the basic idea of a small neural network: its a sequence of chained together linear models.

What's the benefit of putting together multiple linear models? Think of this very simple description of a single input (x) a single output (y) and one single "hidden" layer with two "hidden" parameters (z1 and z2):

drawing

You'd be correct in thinking this is silly. Any feasible output y could be described directly without requiring these two hidden values. Visually, we can see that any combination of two linear models just gives another linear model:

drawing

However, we have one minor modification to make. Rather than using the raw output of the linear regressions (z1 and z2), we apply a function called a Rectified Linear Unit, or ReLU. It is a really fancy name of taking the positive part of the function. If we do this, then we can get a non-linear output y from a chain of linear models:

drawing

In fact, it turns out, with enough hidden layers a neural network is a universal function approximator. That is, it can approximate any (reasonably smooth) function.

Building deep learning models

To start actually building a neural network, we need a few functions from keras. Let's load them in here:

In [5]:
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import normalize
Using TensorFlow backend.

Now, we will see how to build predictive models using neural networks and the keras module. As a starting point, we need to normalize the data matrix x so that each column has unit norm.

In [6]:
x = normalize(x)
y = normalize(num_ilinks).transpose()

y_train = y[:325, :]
y_test  = y[325:, :]
x_train = x[:325, :]
x_test  = x[325:, :]

To build the actual model, we start with an empty sequential model:

In [7]:
model = Sequential()

And then add an input layer. This tells keras how many columns are in x and how many hidden z's we want in the first layer. Let's just use 2 hidden values like our toy example. Notice that I set the 'relu' activation function.

In [8]:
model.add(Dense(units=2, activation='relu', input_dim=5))

Finally, I'll add the output layer. Our response value y has only a single column, so this layer just has one unit.

In [9]:
model.add(Dense(units=1))

We can see the entire model by printing out the model summary:

In [10]:
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 2)                 12        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 3         
=================================================================
Total params: 15
Trainable params: 15
Non-trainable params: 0
_________________________________________________________________

Before trying to learn the parameters in the model from our training data, we need to compile the layers. This makes it much faster to train when using large datasets.

In [11]:
model.compile(loss='mse', optimizer='sgd')

Finally, we can fit the data using our training data. Keras allows us to directly pass the validation data to see how well the function works. Note that this algorithm does not have a specific analytic solution and requires us to simulate the solution. This is what the epochs parameter controls.

In [12]:
model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test))
Train on 325 samples, validate on 424 samples
Epoch 1/5
325/325 [==============================] - 0s 915us/step - loss: 0.0013 - val_loss: 9.9017e-04
Epoch 2/5
325/325 [==============================] - 0s 59us/step - loss: 0.0010 - val_loss: 8.2571e-04
Epoch 3/5
325/325 [==============================] - 0s 56us/step - loss: 8.9699e-04 - val_loss: 7.1445e-04
Epoch 4/5
325/325 [==============================] - 0s 55us/step - loss: 7.9445e-04 - val_loss: 6.4753e-04
Epoch 5/5
325/325 [==============================] - 0s 57us/step - loss: 7.3108e-04 - val_loss: 6.0376e-04
Out[12]:
<keras.callbacks.History at 0x1a2295b8d0>

Prediction works similar to the sklearn functions.

In [13]:
pred = model.predict(x_train)

And we can see what the actual weights are as follows. Here are the weights from the first layer:

In [14]:
model.layers[0].get_weights()
Out[14]:
[array([[-0.07426286,  0.9102794 ],
        [-0.7804654 , -0.7553394 ],
        [-0.22777122,  0.2098676 ],
        [-0.3828494 ,  0.04958075],
        [-0.36965996, -0.6735193 ]], dtype=float32),
 array([0., 0.], dtype=float32)]

And here are the weights from the second layer:

In [15]:
model.layers[1].get_weights()
Out[15]:
[array([[ 0.84061944],
        [-0.26529288]], dtype=float32), array([0.01923799], dtype=float32)]

A deeper model

We can construct much larger and deeper models using keras. Here is a model with four hidden layers with 32 hidden states in each.

In [16]:
model = Sequential()
model.add(Dense(units=32, activation='relu', input_dim=5))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=1))
In [17]:
model.compile(loss='mse', optimizer='sgd')
In [18]:
model.fit(x_train, y_train, epochs=25, validation_data=(x_test, y_test))
Train on 325 samples, validate on 424 samples
Epoch 1/25
325/325 [==============================] - 1s 2ms/step - loss: 0.0029 - val_loss: 0.0012
Epoch 2/25
325/325 [==============================] - 0s 100us/step - loss: 0.0010 - val_loss: 6.4767e-04
Epoch 3/25
325/325 [==============================] - 0s 98us/step - loss: 6.9413e-04 - val_loss: 5.6534e-04
Epoch 4/25
325/325 [==============================] - 0s 95us/step - loss: 6.3973e-04 - val_loss: 5.5147e-04
Epoch 5/25
325/325 [==============================] - 0s 95us/step - loss: 6.2472e-04 - val_loss: 5.4981e-04
Epoch 6/25
325/325 [==============================] - 0s 98us/step - loss: 6.1973e-04 - val_loss: 5.5002e-04
Epoch 7/25
325/325 [==============================] - 0s 97us/step - loss: 6.2029e-04 - val_loss: 5.4983e-04
Epoch 8/25
325/325 [==============================] - 0s 96us/step - loss: 6.1939e-04 - val_loss: 5.4952e-04
Epoch 9/25
325/325 [==============================] - 0s 95us/step - loss: 6.2067e-04 - val_loss: 5.5028e-04
Epoch 10/25
325/325 [==============================] - 0s 95us/step - loss: 6.2002e-04 - val_loss: 5.4969e-04
Epoch 11/25
325/325 [==============================] - 0s 96us/step - loss: 6.2014e-04 - val_loss: 5.4925e-04
Epoch 12/25
325/325 [==============================] - 0s 97us/step - loss: 6.2148e-04 - val_loss: 5.4959e-04
Epoch 13/25
325/325 [==============================] - 0s 102us/step - loss: 6.2097e-04 - val_loss: 5.4936e-04
Epoch 14/25
325/325 [==============================] - 0s 99us/step - loss: 6.1967e-04 - val_loss: 5.5498e-04
Epoch 15/25
325/325 [==============================] - 0s 98us/step - loss: 6.2005e-04 - val_loss: 5.5409e-04
Epoch 16/25
325/325 [==============================] - 0s 101us/step - loss: 6.1948e-04 - val_loss: 5.5904e-04
Epoch 17/25
325/325 [==============================] - 0s 101us/step - loss: 6.2070e-04 - val_loss: 5.5313e-04
Epoch 18/25
325/325 [==============================] - 0s 108us/step - loss: 6.1939e-04 - val_loss: 5.5481e-04
Epoch 19/25
325/325 [==============================] - 0s 105us/step - loss: 6.1987e-04 - val_loss: 5.5254e-04
Epoch 20/25
325/325 [==============================] - 0s 96us/step - loss: 6.1942e-04 - val_loss: 5.5117e-04
Epoch 21/25
325/325 [==============================] - 0s 98us/step - loss: 6.1933e-04 - val_loss: 5.5261e-04
Epoch 22/25
325/325 [==============================] - 0s 95us/step - loss: 6.1982e-04 - val_loss: 5.5394e-04
Epoch 23/25
325/325 [==============================] - 0s 105us/step - loss: 6.1985e-04 - val_loss: 5.5026e-04
Epoch 24/25
325/325 [==============================] - 0s 124us/step - loss: 6.1928e-04 - val_loss: 5.5583e-04
Epoch 25/25
325/325 [==============================] - 0s 116us/step - loss: 6.2046e-04 - val_loss: 5.5112e-04
Out[18]:
<keras.callbacks.History at 0x1a22955898>

Neural networks for classification

We can, and more often than not do, build neural networks for classification tasks. The easiest way to make this happen is by converting a categorical output to a one-hot encoding by building a matrix with one column per category. This can be done with the to_categorical function from keras.

In [19]:
from keras.utils import to_categorical
In [20]:
y = to_categorical(lan_version)
y[:10,]
Out[20]:
array([[0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.]], dtype=float32)

We can then split this into a training and testing set.

In [21]:
y_train = y[:325, :]
y_test  = y[325:, :]

Now, if we build a neural network we need to make two changes: first, the final layer needs to have two units, and secondly the final layer needs a special activation function. The special activation function is called a "softmax" and ensures that the two values are positive numbers that add up to one.

In [22]:
model = Sequential()
model.add(Dense(units=32, activation='relu', input_dim=5))
model.add(Dense(units=2, activation='softmax'))

We also use some different parameters in the model compilation function:

In [23]:
model.compile(loss='categorical_crossentropy',
              optimizer='RMSprop',
              metrics=['accuracy'])

Fitting the model works exactly the same way.

In [24]:
model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test))
Train on 325 samples, validate on 424 samples
Epoch 1/5
325/325 [==============================] - 1s 2ms/step - loss: 0.6685 - acc: 0.6615 - val_loss: 0.6537 - val_acc: 0.6769
Epoch 2/5
325/325 [==============================] - 0s 83us/step - loss: 0.6534 - acc: 0.6615 - val_loss: 0.6394 - val_acc: 0.6769
Epoch 3/5
325/325 [==============================] - 0s 84us/step - loss: 0.6455 - acc: 0.6615 - val_loss: 0.6342 - val_acc: 0.6769
Epoch 4/5
325/325 [==============================] - 0s 91us/step - loss: 0.6425 - acc: 0.6615 - val_loss: 0.6329 - val_acc: 0.6769
Epoch 5/5
325/325 [==============================] - 0s 82us/step - loss: 0.6417 - acc: 0.6615 - val_loss: 0.6302 - val_acc: 0.6769
Out[24]:
<keras.callbacks.History at 0x1a232d8be0>

Conclusions

I'll admit that neither of these toy problems work very well with neural networks. These are not the kinds of problems designed to work well with them. I will also admit that we really do not have the kind of time (nor can I assume that mathematical background) needed to really learn about how to build neural networks in MATH289. I hope, though, that you get something out of these notes. We will be using neural networks in the next tutorial and I think you'll find them, in that form, quite accessable.