# Class 11: Thinking, Fast and Slow

### 2017-10-03

**Have you studied neural networks before or used them in
an internship, job, or research project? If so, please forget
everything you think you know about them before proceeding.**

## Dense neural networks

I want to think about doing something that may seem somewhat silly
at first. Let’s consider what would happen if we describe *two* linear
models to the same dataset. Don’t worry yet why we would want to do
this or how we would do this, just think about what it might look like.
To simplify things, let’s assume that we have only an intercept and
one prediction variable in the model. This will yield two sets of predicted
values, which we will denote by $z_1$ and $z_2$:

And

Notice that x, z_1, and z_2 are all vectors of length n. Now, let’s do something with these two predicted values. We’ll construct another liner model with these outputs, and name this the variable w:

What is the relationship between w and x? In fact, this is nothing but a very complex way of describing a linear relationship with w and x using 7 parameters instead of two. We can see this by simplifying:

The mathematical description is nice, but I find it helpful to also see a picture. Here, I’ll set the $\alpha_3$ term to 0 and the gamma terms both to one. This means that the last model simply adds the two lines from the first regression together. We can graphically see as well that this leads to the final model being linear too:

If we consider w to be the predicted values of this combined model, we could find some values for the seven parameters (3 alpha, 2 betas, and 2 gammas) that minimize the RMSE of the response. Of course, there will not be a unique value for this minimum because ultimately the final model is just a simple linear regression.

At the same time, this is an example of a neural network. It has
two *hidden nodes*, denoted by the z’s, which together form a hidden
layer. The input layer has just one node, a single x, and the output
has just one node, a single w. We can think of this as a network as
follows:

Arrows indicate which variables are being sent to the inputs of the next layer. There is one parameter for each arrow, plus one more for each node other than the input layer for all of the intercepts.

### Activation functions

The trick to why making combinations of linear models is a good idea is both simple and quite subtle. Let’s change the definition of just the variable w to include the application of a function sigma:

Sigma is not an learned function. It is just a fixed mapping that takes
any real number and returns another real number. In neural network
parlance it is call an activation function. Today we will use the
simpliest such activation, known as a **rectified linear unit (ReLU)**.
In what is the perhaps the biggest disconnect between the complexity
of a term and the complexity of what the term represents, this is the
defention of of a ReLU activation:

Positive values are treated as an identity function and negative values are pushed to zero. That is all.

It turns out that the addition of this activation function greatly increases the complexity of the relationship between x and w. Notice how we can now create non-linear relationships by combining two linear functions where we have applied the ReLU unit before adding them together:

The output now looks similar to a quadratic term. If we put more hidden units into the model, and more input values into the first layer, we can create very interesting non-linear interactions in the output. In fact, with enough hidden neurons, a one-hidden layer neural network can approximate any smooth function between x and w within an arbitrarily small precision.

## Keras

### Installation in R

To build neural networks in R, we are going to use the keras library. Originally developed for Python, there are now several wrappers for R. Instructions to install it an included at the end of today’s class notes. Here, I have already installed it and load the package in as you would any other package.

### Housing data again

For the first bit of today, let’s use the (easy) California housing dataset. We will move onto something different in a moment, but using this model will be helpful.

### Replication the model above

Like `glmnet`

and `xgboost`

, keras requires our data to be in
the form of a matrix. Let’s grab such a matrix using just the
latitude variable. I will also scale the data matrix at the
same time

To build a model with **keras**, we first need to create a base
empty model using the function `keras_model_sequential`

. We then
add layers using the `%>%`

function. To replicate our very simple
model above, we use `layer_dense`

, which is just a plain vanilla
set of neurons with no fancy logic inside of them. We’ll also use
the relu activation function, applied here as a seperate layer with
no learned parameters.

Notice that printing the model gives a nice summary of all of the layers, the number of parameters in each layer, and the output shape that is fed into the next layer.

Once the model has been built, we need to compile it before doing anything else with it. Here we set the loss, the thing we want to minimize, to the mean squared error, and set a particular algorithm for finding all of the best parameters. More on this in a bit.

Once the model has been compiled we can fit it to our dataset.

Notice that there is a lot of output that comes along with fitting a neural network. As with gradient boosted trees, we get a running tally of the training and validation performance. We’ll talk more about what Epoch is and how this model is actually fit next class.

### Predicted values

I’ll use a prediction function to predict values from this model. Notice that the learned relationship is more complex than a straightforward linear regression:

### Larger models

We can easily add more hidden layers with more variables. Here we have two hidden layers, each with 8 variables

Notice that only the first layer needs the `input_shape`

parameter.
We’ll compile and fit this again:

The relationship is now even more non-linear. We have two peaks, one for LA and the other for San Fransisco, and a steep decline in northern California.

At this point, neural networks should seem like a natural extension of the models we have already seen rather than a fancy black-box. It is, in my opinion, too often explained as the latter.

## Classification with neural networks

We have just seen how to use neural networks to do regression with one
continuous response. Of course, we could modify this to do binary
classification by coding factors as 0 and 1. Further, the *one-vs-one*
and *one-vs-many* paradigms would allow this to extend to multiclass
classification. There is, however, a more straightforward approach.
If you recall that we used the **nnet** package to fit the multinomial
function, this should not be surprising.

### One-hot encoding

The trick for doing multiclass regression with neural networks is to realize that we can easily make the last layer of a neural network output multiple values. All we need to do is have the function that we want to minimize by a function of all these outputs rather than a single output with each row of the data. An easy way to do this is to assign multiple response variables with each observation and do mean squared error loss, but we will now take the mean squared loss of predicting all the outputs.

The **keras** package include a function `to_categorical`

that converts
numeric class labels to binary indicator variables. This is very similar
to the model matricies we built for the X matrix when using categorical
predictors. Here is a simple example of its application:

Notice that keras is 0-indexed, so it wants the first category to be zero. That’s why we have 11 columns even thought the largest category is ten. The best way to understand how to use this response matrix is to see a worked example in keras.

### Chicago Crimes, again

I am going to load another version of the Chicago crimes data, but this time there are 12 crime types:

I want to just use latitude, longitude, and hour here to simplify the analysis. The X data matrix is unchanged from previous code examples other than the fact that we scale the columns.

Typically, I create a y variable as before, but make sure to subtract one so that
the smallest category is zero. When creating `y_train`

and `y_valid`

, I use the
`to_categorical`

function, manually specifying the number of classes to avoid any
issues:

Take note of what the response matrix looks like:

### Building a classification model

Here is the specificiation of a model that could train on our dataset. The input shape is three because we have three input variables. The last layer has 12 units because the respones has 12 columns. I use a new activation layer at the end called a “softmax”. The forces the output values to all be positive and sum to 1; therefore we can safely view them as proper probabilities.

Notice that this fairly simple model already has 752 parameters that we need to learn. Neural networks can get complex very quickly!

Now, we compile the model. While “mse” could work for the function we want to minimize, something called “categorical_crossentropy” is better for some technical reasons when learning to do multiclass classification. We’ll also print out the accuracy at each step.

Random guessing yields a classification rate of 8%, so our model does approximately twice as well as guessing. Not bad for a first take at using neural networks.

In order to get predictions on the whole dataset, we can use the function `predict_classes`

.
A `predict`

function exists, but it would give us a matrix of values much like `y_train`

.
The `predict_classes`

function helpfully saves a step. **Don’t forget to add one to predictions!**

Finally, take a look at the lower right-hand panel in RStudio. You will see a helpful plot of the training and validation loss functions metrics.

## Installation

To install keras, we need to install the package, the keras library itself, and python. The best first step is to install keras as you would any package:

Once that is done, load the library and try to run the `install_keras`

function:

Depending on your system, it may just run and download some files and
you are all set to go. In the future, just call `library(keras)`

as
you would any other package. If it complains about not having Python,
install Anaconda for Python 3 from this link:

Then, restart RStudio and run the block of code again. If there is a different error message (most likely you are one a Mac), try to do whatever it says.

If you run into trouble with this, please let me know now. We are doing this one unit with neural networks today, but will return to them in force for the entire month of November.