# Class 20: Faster, Higher, Stronger (and Deeper!)

### 2017-11-07

## Local Weights in Neural Networks

We know that it is possible to unravel the pixel counts describing an image to turn images into high-dimensional matrix. The columns from this can be put into an elastic net or a neural network to train much as we would with any other numeric dataset. There is, however, quite a lot of information contained in the structure of the image data that we are losing by this approach. Ideally, we would use the information about the fact that some pixels as close to one another and that certain color channels describe the same or neighboring pixels.

The solution is to use convolutional neural networks (CNNs). Despite their name, CNNs are not actually a different kind of neural network but instead refer to a particular kind of layer in a neural network. In a purely pragmatic sense of writing the code, adding convolutional layers into a keras model is very easy. Understanding what they are actually doing, however, can sometimes be difficult. Today we will spend some time trying to build up to convolutional layers before showing how they work on our data.

## Convolutions for edge detection

We will start by considering convolutions created manually outside of a neural network. To start, we need a kernel matrix. Let’s use a kernel with one row and 2 columns:

The convolution implied by this kernel takes the pixels in an image and subtracts the value of the pixel value to its immediate right. If the input image has a resolution of 28-by-28, what size and shape with the result of this convolution be? Without modification, it needs to be 28-by-27. The lost column comes because we do not have a way of applying the convolution to pixels in the right-most column of the image. This is usually fixed (keeping the image size constant is useful) by adding a virtual column of 0’s. With this, the result of the convolution is another image of size 28-by-28.

Let’s apply this to a small example. Here we have an input “image” of only 8-by-8. The image seems to show something like a capital letter “L”.

By constructing the variable x2 as version of x padded with zeros, we can apply the kernel matrix as follows:

Typically we will apply a ReLU activation to the output of a convolution, so we care only about the positive values. Notice here that almost all of these occur where there is a vertical edge to the capital “L”. There is also one bit of an activation at the end of the bottom part of the letter.

Now, consider a different kernel given by a matrix with one column and two rows:

To apply this, we take each pixel value and subtract the value
immediately below it in the image. By padding the input with an
extra row of 0’s, we can get an output image that is the same
size as the input. Let’s apply this to the matrix `x`

as well:

The positive values from this kernel all occur on the bottom edge of the “L”. So the first kernel detects vertical edges and the second one detects horizontal images. Presumably, knowing where in the image these types of lines are would be very helpful in identifying digits, fashion items, letters, and other types of objects.

If we have a training set here of 1000 images, our input data will have a dimension of

Because these are 8-by-8 black and white images. What is the dimension after applying these two kernels? Well, for each kernel we have an image of the same size as the original, so we have

Each of the outputs from the kernels is called a filter. Here we have two filters. You can think of these similarly to the red, green, and blue components of a color image. Each filter tells something useful about a particular part of the original image. Here, it tells whether there is a vertical edge (first component) or horizontal edge (second component).

We could apply another convolution to the output of the first set of convolutions. The kernel here would need to be three dimensional, with a depth of 2, because it has to say what to do with each of the two filters. Likewise, a kernel for an input color image needs three dimensions as well.

Once we have applied convolutions, you an imagine for most applications
we do not care exactly where edges or any other features are found. For
digit detection we instead just care *generally* where edges of a
particular type are found. With a large number of filters, the data
size after a convolution can also quickly become quite large. A solution
to this is known as *max pooling*. We reduce the width and height of
each input by a factor of two by dividing the image into 2-by-2 blocks
and taking the maximum value of each filter within a block. Here, we
apply max pooling, as well as ReLU activations, to the values in the
vertical filter:

In theory, we can pool using other sizes, such as a grid of 3-by-3 points. However, the 2-by-2 is the most common and rarely do we need anything else.

If we apply max pooling, the dataset now has a size of:

Consider another convolution with 5 filters. The resulting size becomes:

With max pooling again, we get:

Once we have used enough combinations of pooling and convolution, the
array can *then* be unravelled to form a dataset of size

This data is small relative to the input and has already learned localized features. A dense neural network can then use it to produce predicted probabilities.

# #

Now, we will apply convolutions and max pooling the context of an actual neural network. This is exactly the same as described in our small example above, however the values of the weights in the kernels are learned from the data rather than being pre-determined. This means that they have the power to detect patterns we would not have thought of, but it also comes at the cost of not longer being able to describe exactly what each filter is doing.

We start by working with the MNIST dataset again. We need the data to remain in its array-format, so we will not collapse it into a matrix this time.

We now build a simple convolution neural network, with just
one convolutional layer with 16 filters, followed by max pooling.
Typically, unlike our example above, we use square kernels. Most
often these are of size 2-by-2, though 3-by-3 and even 5-by-5 are
seen in certain architectures. Note that we need to fully
describe the correct dimension of `X_train`

. We also need a
layer called `layer_flatten`

when going from the convolutional
part of the network to the dense part.

Setting `padding`

to “same” is what maxes it so that the first layer
outputs images of size 28-by-28.

Notice that this model does not have many weight compared to the neural networks we used last time. It takes a fairly long time to run, though, given its size. The reason is that convolutions generally do not have a large weights because for each filter we are only learning 2-by-2-by-(num prior filters) values, unlike the dense layers that connect everything in one layer to everything in the prior layer. Computing the gradient and the inputs to the next layer, however, take a long time because we have to apply each kernel to the entire image.

### Visualize the kernels

Much as we did with the weights in the dense neural networks, we can visualize the kernels learned by the neural network. To start, make sure that the sizes of the weights in the convolutional layer make sense:

The 16 in each refers to the specific kernel, with all 16 put into a single 4-dimensional array.

There is not much we can directly do with these, but it is good to see them in order to check whether you understand what these convolutions are doing to the input image.

### Le-Net for MNIST-10 (1995)

Yann LeCun, one of the creators of the MNIST-10 dataset, was a pioneer of using CNNs. His 1995 paper provided one of the first examples where CNNs produced state-of-the-art results for image classification:

- LeCun, Yann, et al. “Learning algorithms for classification: A comparison on handwritten digit recognition.”
*Neural networks: the statistical mechanics perspective*261 (1995): 276.

I have tried to reproduce his exact model and training technique here in keras. As techniques were not standardized at the time, this only approximate, but gets to the general point of how powerful these CNNs are.

To fit the model, we do SGD with a large learning rate that is manually decreased after a handful of epochs. Remember, in 1995 computers were not nearly as powerful as what they are today, so it was feasible to only run a limited number of epochs even on a large research project and with the resources at AT&T Bell Labs.

The prediction here is fairly good, and better than the dense networks I had last time.

This is not quite as good as the 99% accuracy reported in the paper. Likely this is due to using both the training and validation sets as well as some tweaks Yann LeCun and company used that was not fully documented in their paper. This is a common but frustrating patter than occurs often in neural network literature.

## EMNIST

The EMNIST dataset is a relatively new dataset constructed by NIST. It matches the data from the handwritten digits in size, but uses letters rather than numbers. For size reasons, here we will look at just the first ten letters of the alphabet.

Let’s read the data in now:

Notice that the dataset includes both upper and lower case letters, making the task in theory more difficult than the MNIST data.

Here is a much larger CNN model for use with the EMNIST dataset. It follows some typical design patterns, namely:

- uses 2-by-2 kernels with a power of 2 number of filters
- uses ReLU activations throughout
- uses “double convolutions”; that is, one convolution followed by another without a pooling layer in between
- uses a 2-by-2 max pooling layer and dropout after the double convolution
- top layers use power of 2 number of nodes, with a uniform number in each layer
- drop out at the last layer, along with a softmax activation
- padding uses the “same” logic, to make max pooling even (i.e., it keep dimensions factorizable by a power of 2)

And here is the actual model:

Given the added difficulty of this task, our result of 96% is not a bad start.

For the next lab you’ll be working with the EMNIST dataset, but using a different subset of the letters.

### Negative examples, again

Which letters are the hardest to distinguish?

Looks like the hardest to classify are “a” and “g”, followed by “g” and “b”; as well as “ “h” and “a” (likely all lower case versions).

The blue letter gives the initial value and red the predicted value. It is generally a good sign that many of these are hard for us to max out as well. Some of them, in fact, I would have guess the same as the neural network even though in theory they were not the “correct” labels.