Our initial approach to working with textual data involved counting
the occurrence of words, characters, or sequences of either. These
were summarized in a term frequency matrix, which we typically then
fed into an elastic net given the high-dimensionality of the dataset.
What subtleties are lost here?
word placement in the text
We briefly tried to address some of these by using a text parser, but
these these amounted to simply creating more subtle versions of counting
the occurrence of things. We have not yet seen a real way of working
with the true order of the text.
How can we use a neural network to solve this problem? The output structure
is easy, and matches the one-hot encoding of the image processing tasks.
(from the idea that just one of the bytes is `hot’, i.e. turned on.)
What about the input layer though? How do we feed a block of text into a
Let’s first simplify the problem and think about just a single word at a
time. How can we represent even a single word as an input to a neural
network? One approach is to determine a vocabulary of terms, these
are all of the words that we want to support in the classification task.
Usually we construct this by looking at all of the snippets of text and
taking the N-most commonly occurring words. Any words in the texts not
in this vocabulary are removed before further training. This is the same
thing we did with term frequency matrices, but now we are considering
just a single word.
Once we have this vocabulary, we can represent a single word by
an $N$-length binary vector with exactly one $1$:
This is just another one-hot representation.
Suppose we use a one-hot representation as the first layer in a neural
network. If this is followed directly by a dense layer with p hidden
neurons, the weights in the layer can be defined as an N-by-p matrix
W. In this special case we do not need a bias term, because we already
know the scale of the previous layer (0’s and 1’s).
For a given set of weights W, because of the one-hot representation, the
values of the outputs from the first hidden layer will simply be row j
of the matrix W, where j is the index of the input word in the vocabulary.
A word embedding is nothing more than a compression of a one-hot
representation and a dense hidden layer in a neural network. There
is no need to actually create the one-hot vector, and multiply by
all of W. We can just go directly from the index of the word in
the vocabulary, and read off of the j-th row of W.
What are we doing here, really? The whole idea is to map a word as
a vector in a p-dimensional space:
This is very similar to the transfer learning we did with images,
where each image was projected into a 512-dimensional space.
This is great, but most of the time we want to work with
a collection of words (a document) all as one entity. A
simple way to do this is to apply the word embedding to
each term, and the collapse (flatten) these into a single
So if we have T terms in a document and a word embedding
with p terms, the output from the embedding layer will be
of size T times p. To be clear, the embedding step is agnostic
to the position of the word, much like the shared weights in a
convolutional neural network. The word “apple” is matched to
the same vector regardless of where in the sentence it is
Let’s assume that we have the following word embeddings:
In this example we have a three-dimensional embedding; the first
component indicates whether this a common function word (article,
preposition, conj, ect.), the second whether its an animal, and the
third whether it is an article of clothing. We would map the
“The cat is in the hat and sweater”
To the array:
We can flatten this to an input dataset with 24 (8*3) inputs,
much as we did with the image data, or apply convultions that
respect the dimensionality of the input.
Structure for today
We will persist in one simplification today: assuming that
each document has the same number of terms. This is done by
truncating at some small number of words (less than what
most of the movie reviews are) and filling in any trailing
space by a special embedding of zeros. Though as you will see,
this is easy to rectify for working with text of larger sizes.
Predicting letter bigrams
As a starting example, let’s look again at the Amazon dataset
predicting which category a review comes from. This time, however,
we will not be predicting the class of the item. We are just
using the text data as a good source of natural English language
Instead of embedding entire words, we will initially consider
embedding character bigrams:
Our prediction task will be to look at a window of 9 bigrams
(18 letters) within the text, using the first 4 bigrams and
last 4 bigrams to predict the middle two bigrams. For example,
if we consider the phrase:
I was quite sleepy
We take the lower case version (removing punctuation marks)
and chop it up into bigrams:
[I ][wa][s ][qu][it][e ][sl][ee][py]
The goal is to see the following:
[I ][wa][s ][qu][??][e ][sl][ee][py]
And use the the context to predict that the missing piece is
the bigram “it”.
As a first step we need to construct the dataset. We remove all
the non-letters and spaces, convert to lower case, and tokenize
by characters. We then paste together pairs from the first 20
letters to get the required bigrams (we only need the first 9
bigrams, but I grabbed more to illustrate the point).
Next we create a vector char_vals listing all of the possible
bigrams occurring in the dataset, and convert the bigrams into
integer codes mapping into these values.
Now, the data matrix X consists of columns 1 through 4
and 6 through 9. The response y uses the 5th column, offset
by 1 to make the category ids match the zero-index used by
keras. We don’t have a training and testing set here, so
I will construct one manually.
Now, process the training data for the neural network.
We proceed by constructing a layer_embedding in keras. We
supply the length of the vocabulary (input_dim), the size
of the embedding (output_dim) and the the number of terms
in the input (the columns of X, input_length). The output
will not be a matrix, so we flatten the embedding, apply a dense
layer, and the softmax the expected probabilities.
Can you figure out where the number of parameters in the
embedding layer comes from?
From here, the compiling and training the model uses the
exact some code as with dense and convolutional neural networks.
The validation error rate is around 40%, not bad considering that
this is classification with a categorical variable having 100s of
We can put the predictions together to see what the neural network
predicts for some example bigrams:
Perhaps more impressive than the correct response are the type of
human-like errors it makes.
The embedding layer consists of a 510-by-50 matrix. This gives the
projection of each bigram in 50-dimensional space. We can access
this projection with the function get_layer.
Visualizing the embedding is hard because it exists in 50-dimensional
space. We can take the embedding and apply t-SNE to do dimensionality
I did not see any particularly noticeable patterns, but we will see
that the word variant of this often shows off understandable patterns
in the data.
Now, let’s use actual work embeddings to classify Amazon products.
As with the bigram embeddings, we first need to figure out the vocabulary
of terms we will allow in the model. Unlike with bigrams, there are too
many unique words to include them all. Instead, I will use just the most
frequently used 5000 words (it will actually be slightly more than this
because top_n, when faced with ties, will create a slightly larger
set rather than randomly selecting just 5000).
With this vocabulary, we next create numeric indicies, throwing
away terms not in our vocabulary.
Of course, not every review will contain the same number of words.
To deal with this we pick a particular reference length, truncating
longer sentences and padding shorter sentences with zeros. This is
all handled by the keras function pad_sequences (options
exist for whether truncation and/or padding is done as a prefix
or a suffix; the defaults are usually best unless you have a particular
reason for preferring an alternative).
The zero index is treated specially by the embedding, and always mapped
to the zero vector. Now we process the data as usual:
And construct a neural network with an embedding layer, a flatten
layer, and then dense and output layers.
Finally, we compile the model and fit it to the data
The model performs reasonably well, though not yet a noticeable
improvement over the elastic net models we built.
The problem is that while in the elastic net models we entirely
ignore word ordering, here we go all the way to the other extreme.
The model treats the sentence “I thought this movie was really good”
entirely different than “This movie was really good” because weights
in the dense layer apply specifically to each position of the word.
Using a 1-dimensional variation of convolutions will solve this
It turns out we can think of the output of the word embeddings
as being similar to the multidimensional tensors in image
For example, consider a word embedding with p equal to 3. We can see
this as three parallel 1-dimensional streams of length T,
much in the way that a color image is a 2d-dimensional combination
of parallel red, green, and blue channels.
In this context, we can apply 1-D constitutional layers
just as before: shared weights over some small kernel. Now, however,
the kernel has just a single spatial component. We also apply max
pooling to the sequence, with a window size of 10 (windows in one
dimension tend to be larger than in 2).
Fitting this model on the data shows that it performs better than
the original embedding model (I believe it also bests all of the
models we constructed previously for the Amazon data).
Let’s see if we can do better by including longer sequences
I will replace the max pooling layer with a layer_global_average_pooling_1d.
This does pooling across all the nodes into one global number for each
filter. It is akin to what we did the VGG-16 transfer learning problem
where I took the maximum of each filter over the entire 7-by-7 grid.
Fitting this model shows a further, non-trivial improvement on the
It should make sense that using more data perform better and that
the global position in the text of a sequence of words should not
matter much to the classifier.
Visualize the word embedding
Once again, we can visualize the word embedding from this model.
There are over 5000 words though, making a full plot fairly noisy.
Instead, let’s re-run the elastic net to find the terms the popped
out of the model as being important to the Amazon product classification
and look only at the projection of these.
We’ll grab the embedding layer from model and filter to
just those words in the elastic net model.
Running PCA on the resulting data shows some interesting
Notice that words associated with each category tend to
clump together in the PCA space. This is particularly
evident with the food data.