We know that it is possible to unravel the pixel counts
describing an image to turn images into high-dimensional
matrix. The columns from this can be put into an elastic net or a
neural network to train much as we would with any other
numeric dataset. There is, however, quite a lot of information
contained in the structure of the image data that we are losing
by this approach. Ideally, we would use the information about the
fact that some pixels as close to one another and that certain
color channels describe the same or neighboring pixels.
The solution is to use convolutional neural networks (CNNs).
Despite their name, CNNs are not actually a different kind of
neural network but instead refer to a particular kind of layer in
a neural network. In a purely pragmatic sense of writing the
code, adding convolutional layers into a keras model is very
easy. Understanding what they are actually doing, however,
can sometimes be difficult. Today we will spend some time
trying to build up to convolutional layers before showing
how they work on our data.
Convolutions for edge detection
We will start by considering convolutions created manually
outside of a neural network. To start, we need a kernel
matrix. Let’s use a kernel with one row and 2 columns:
The convolution implied by this kernel takes the pixels in
an image and subtracts the value of the pixel value to its
immediate right. If the input image has a resolution of
28-by-28, what size and shape with the result of this
convolution be? Without modification, it needs to be 28-by-27.
The lost column comes because we do not have a way of applying
the convolution to pixels in the right-most column of the image.
This is usually fixed (keeping the image size constant is
useful) by adding a virtual column of 0’s. With this, the
result of the convolution is another image of size 28-by-28.
Let’s apply this to a small example. Here we have an input “image”
of only 8-by-8. The image seems to show something like a capital
By constructing the variable x2 as version of x padded with zeros,
we can apply the kernel matrix as follows:
Typically we will apply a ReLU activation to the output of
a convolution, so we care only about the positive values.
Notice here that almost all of these occur where there is
a vertical edge to the capital “L”. There is also one bit of
an activation at the end of the bottom part of the letter.
Now, consider a different kernel given by a matrix with
one column and two rows:
To apply this, we take each pixel value and subtract the value
immediately below it in the image. By padding the input with an
extra row of 0’s, we can get an output image that is the same
size as the input. Let’s apply this to the matrix x as well:
The positive values from this kernel all occur on the bottom edge of
the “L”. So the first kernel detects vertical edges and the second
one detects horizontal images. Presumably, knowing where in the image
these types of lines are would be very helpful in identifying digits,
fashion items, letters, and other types of objects.
If we have a training set here of 1000 images, our input data
will have a dimension of
Because these are 8-by-8 black and white images. What is the dimension
after applying these two kernels? Well, for each kernel we have an
image of the same size as the original, so we have
Each of the outputs from the kernels is called a filter. Here we have
two filters. You can think of these similarly to the red, green, and blue
components of a color image. Each filter tells something useful about a
particular part of the original image. Here, it tells whether there is
a vertical edge (first component) or horizontal edge (second component).
We could apply another convolution to the output of the first set of
convolutions. The kernel here would need to be three dimensional, with
a depth of 2, because it has to say what to do with each of the two
filters. Likewise, a kernel for an input color image needs three dimensions
Once we have applied convolutions, you an imagine for most applications
we do not care exactly where edges or any other features are found. For
digit detection we instead just care generally where edges of a
particular type are found. With a large number of filters, the data
size after a convolution can also quickly become quite large. A solution
to this is known as max pooling. We reduce the width and height of
each input by a factor of two by dividing the image into 2-by-2 blocks
and taking the maximum value of each filter within a block. Here, we
apply max pooling, as well as ReLU activations, to the values in the
In theory, we can pool using other sizes, such as a grid of 3-by-3
points. However, the 2-by-2 is the most common and rarely do we need
If we apply max pooling, the dataset now has a size of:
Consider another convolution with 5 filters. The resulting size
With max pooling again, we get:
Once we have used enough combinations of pooling and convolution, the
array can then be unravelled to form a dataset of size
This data is small relative to the input and has already learned
localized features. A dense neural network can then use it to produce
Now, we will apply convolutions and max pooling the context of an
actual neural network. This is exactly the same as described in our
small example above, however the values of the weights in the kernels
are learned from the data rather than being pre-determined. This means
that they have the power to detect patterns we would not have thought
of, but it also comes at the cost of not longer being able to describe
exactly what each filter is doing.
We start by working with the MNIST dataset again. We need
the data to remain in its array-format, so we will not collapse
it into a matrix this time.
We now build a simple convolution neural network, with just
one convolutional layer with 16 filters, followed by max pooling.
Typically, unlike our example above, we use square kernels. Most
often these are of size 2-by-2, though 3-by-3 and even 5-by-5 are
seen in certain architectures. Note that we need to fully
describe the correct dimension of X_train. We also need a
layer called layer_flatten when going from the convolutional
part of the network to the dense part.
Setting padding to “same” is what maxes it so that the first layer
outputs images of size 28-by-28.
Notice that this model does not have many weight compared to the
neural networks we used last time. It takes a fairly long time to
run, though, given its size. The reason is that convolutions generally
do not have a large weights because for each filter we are only learning
2-by-2-by-(num prior filters) values, unlike the dense layers that
connect everything in one layer to everything in the prior layer.
Computing the gradient and the inputs to the next layer, however, take
a long time because we have to apply each kernel to the entire image.
Visualize the kernels
Much as we did with the weights in the dense neural networks, we can
visualize the kernels learned by the neural network. To start, make
sure that the sizes of the weights in the convolutional layer
The 16 in each refers to the specific kernel, with all 16 put into
a single 4-dimensional array.
There is not much we can directly do with these, but it is good to
see them in order to check whether you understand what these
convolutions are doing to the input image.
Le-Net for MNIST-10 (1995)
Yann LeCun, one of the creators of the MNIST-10 dataset, was a
pioneer of using CNNs. His 1995 paper provided one of the first
examples where CNNs produced state-of-the-art results for image
LeCun, Yann, et al. “Learning algorithms for classification: A comparison on handwritten digit recognition.” Neural networks: the statistical mechanics perspective 261 (1995): 276.
I have tried to reproduce his exact model and training technique
here in keras. As techniques were not standardized at the time, this
only approximate, but gets to the general point of how powerful these
To fit the model, we do SGD with a large learning rate that is
manually decreased after a handful of epochs. Remember, in 1995
computers were not nearly as powerful as what they are today,
so it was feasible to only run a limited number of epochs even
on a large research project and with the resources at AT&T Bell
The prediction here is fairly good, and better than the dense
networks I had last time.
This is not quite as good as the 99% accuracy reported in the paper.
Likely this is due to using both the training and validation sets
as well as some tweaks Yann LeCun and company used that
was not fully documented in their paper. This is a common but
frustrating patter than occurs often in neural network literature.
The EMNIST dataset is a relatively new dataset constructed by NIST.
It matches the data from the handwritten digits in size, but uses
letters rather than numbers. For size reasons, here we will look at
just the first ten letters of the alphabet.
Let’s read the data in now:
Notice that the dataset includes both upper and lower case letters,
making the task in theory more difficult than the MNIST data.
Here is a much larger CNN model for use with the EMNIST dataset.
It follows some typical design patterns, namely:
uses 2-by-2 kernels with a power of 2 number of filters
uses ReLU activations throughout
uses “double convolutions”; that is, one convolution followed
by another without a pooling layer in between
uses a 2-by-2 max pooling layer and dropout after the
top layers use power of 2 number of nodes, with a uniform
number in each layer
drop out at the last layer, along with a softmax activation
padding uses the “same” logic, to make max pooling even (i.e.,
it keep dimensions factorizable by a power of 2)
And here is the actual model:
Given the added difficulty of this task, our result of 96% is
not a bad start.
For the next lab you’ll be working with the EMNIST dataset, but
using a different subset of the letters.
Negative examples, again
Which letters are the hardest to distinguish?
Looks like the hardest to classify are “a” and “g”,
followed by “g” and “b”; as well as “ “h” and “a”
(likely all lower case versions).
The blue letter gives the initial value and red the predicted value. It
is generally a good sign that many of these are hard for us to max
out as well. Some of them, in fact, I would have guess the same as the
neural network even though in theory they were not the “correct” labels.