# Class 23: Lions, Tigres, and 狗熊 (oh my)

### 2017-11-16

## Recurent Neural Networks (RNN)

Recurrent neural networks address a concern with traditional neural networks that becomes apparent when dealing with, amongst other applications, text analysis: the issue of variable width inputs.

This is also, of course, a concern with images but the solution there is quite different because we can stretch and scale images to fit whatever size we need at the moment. This is not so with words. We saw that this can be somewhat dealt with using 1D convolutions, but these do not quite match the way that humans process text.

One way of framing of this problem is to think of a string of text as streaming in over time. Regardless of how many words I have seen in a given document, I want to make as good an estimate as possible about whatever outcome is of interest at that moment. This, on the other hand, is fairly consistent with the way to human brain works (at least on a high-level).

Using this idea, we can think of variable width inputs such that each new word simply updates our current prediction. In this way an RNN has two types of data inside of it:

- fixed weights, just as we have been using with CNNs
- stateful variables that are updated as it observes words in a document

We can also think of this as giving “memory’ to the neural network.

A third way of thinking about recurrent neural networks is to
think of a network that has a loop in it. However, the self-input
get’s applied the *next* time it is called.

A fourth way of thinking about a recurrent neural network is mathematically. We now have two parts to the update function in the RNN:

Notice that $U$ must always be a square matrix, because we could unravel this one time further to yield:

One confusing bit, at least for me the first time I saw RNNs, is the relationship between time and samples. We typically restart the state, or memory, of the RNN when we move on to a new sample. This detail seems to be glossed over in most tutorials on RNNs, but I think it clarifies a key idea in what these models are capturing.

In truth, an RNN can be seen as a traditional feedforward neural network by unrolling the time component (assuming that there is a fixed number of time steps).

While it is nice that we get a “running output” from the model, when we train RNNs we typically ignore all but the final output to the model. Getting the right answer after we have looked at the entire document is the end goal, anyway. To do this, back-propogation can be used as before.

While we could unroll the RNN into a FF network and apply the algorithms for dense networks, for both memory consumption and computational efficiency techniques exist to short-cut this approach.

## RNNs for product detection

In order to make direct comparisons on the CNN approach to text analysis, let’s use the Amazon product classification dataset one more time. The input dataset is the exact same as before.

And then construct the training data.

To construct an RNN layer, we use `layer_simple_rnn`

. There
is a seperate dropout for the recurrent and output layers, so
we specify these directly in the RNN model. Notice that with
`return_sequences`

equal to `FALSE`

, the RNN converts the
input tensors into a 2-dimensional dataset. Therefore we do
not need to include a flattening layer.

Compiling the model and running the algorithm can be done the
same way as the dense and CNN models. The `adam`

optimizer is,
however, particularly good at finding a good learning rate for
recurrent neural networks.

The model performs reasonably well, though not as accurately as the CNNs from last time.

For illustration, let’s look once again at the word embeddings
implied by this model. We will find the variables selected from
the `glmnet`

function and then use t-SNE to plot the variables.

Overall, it behaves similarly to the embedding we saw with the CNN models. This makes sense as the embedding part should not depend in a particular way on whether we use RNNs or CNNs.

## Long short-term memory (LSTM)

Because of the state in the model, words that occur early in the sequence can still have an influence on later outputs.

Using a basic dense layer as the RNN unit, however, makes it so that long range effects are hard to pass on.

Long short-term memory was original proposed way back in 1997 in order to alleviate this problem.

- Hochreiter, Sepp, and Jürgen Schmidhuber. “Long
short-term memory.”
*Neural computation*9, no. 8 (1997): 1735-1780.

Their specific idea has had surprising staying power. A great reference for dissecting the details of their paper is the blog post by Christopher Olah:

I will pull extensively from it throughout the remainder of today’s notes.

Some people consider LSTM’s to be a bit hard to understand; here is a diagram from the original paper that partially explains where the confusion comes from!

In fact, though, basic idea of an LSTM layer is exactly the same as a simple RNN layer.

It is just that the internal mechanism is just a bit more complex, with two separate self-loops and several independent weight functions to serve slightly different purposes.

The diagrams use a few simple mechanics, most of which we have seen in some form in CNNs. The pointwise operation, for example, is used in the ResNet architecture when creating skip-connections.

A key idea is to separate the response that is passed back into the LSTM and the output that is emitted; there is no particular reason these need to be the same. The \textbf{cell state} is the part of the layer that get’s passed back, and is changed from iteration to iteration only by two linear functions.

Next, consider the \textbf{forget gate}. It uses the previous output $h_{t-1}$ and the current input $x_t$ to determine multiplicative weights to apply to the cell state. We use a sigmoid layer here because it makes sense to have weights between $0$ and $1$.

Next, we have a choice of how to update the cell state. This is done by multiplying an input gate (again, with a sigmoid layer) by a tanh activated linear layer.

The cell state of the next iteration is now completely determined, and can be calculated directly.

Now, to determine the output of the model, we want to emit a weighted version of the cell state. This is done by applying a tanh activation and multiplying by the fourth and final set of weights: the output weights. This passed both as an output to the LSTM layer as well as into the next time step of the LSTM.

Over the years, variants on the LSTM layers have been given.
Confusingly, these are often presented **as** LSTM layers
rather than minor variants on the original technique. One
modification is to add *peepholes* so that the input,
forget, and output gates also take the current cell state
into account.

One natural extension is to set the input and forget gates to be the negation of one another.

A more dramatically different alternative is known as a Gated Recurrent Unit (GRU), originally presented in this paper:

- Cho, Kyunghyun, Bart van Merriënboer, Dzmitry Bahdanau,
and Yoshua Bengio. ``On the properties of neural machine
translation: Encoder-decoder approaches.”
*arXiv preprint*arXiv:1409.1259 (2014).

One benefit is that is offers a slight simplification in the model with no systematic performance penalty. Along with LSTM, it is the only other model implemented in keras, which should point to its growing popularity.

In short, in combines the input and cell states together, and combines the forget and input gates. This results in one fewer set of weight matrices to learn.

## LSTM and sequence visualization

Let’s reconstruct our Amazon dataset and then apply an LSTM model.

The LSTM layer is called with `layer_lstm`

and takes the same
default inputs and options.

Notice that the model has more parameters because of the need to fit several models within each unit. We compile again with ADAM and fit the model on the Amazon data.

We are now getting results that are approximately as good as our best CNN model.

## Outputting sequences

So far we have set the `return_sequences`

option to the RNN layers
to false. This means that we only get a response for the very last
word in the sequence. A benefit of RNNs is the ability to get predictions
as words stream in, so we want to rectify this.

As far as I can tell, the only way to do this in **keras** is to first
construct a new model with the exact same architecture but set the
RNN to return sequences and wrap all high levels in the function
`time_distributed`

. We then use the method `set_weights`

to map the
weights from the training model into this new model.

Notice that we could not have started with this one because we would
need an output `Y`

with 100 columns (we could replicate the data 100
times, but this would then train models on whether they can identify
the category in the middle of the text, which is not our primary goal).

Predicting results from this model now gives 100 predictions for each input.

We can see that the prediction rate increases as the number of words increases. Much of this is due to shorter sentences that are padded with zeros, but there is also something powerful about having all of the words rather than just a few of them.

It will be informative to see exactly how these predictions
match up to the input text. We will create a matrix `X_words`

containg the actual input words and `pred_class`

of the
predicted classes.

Our function `see_text`

plots the predicted probabilities along
with the raw input.

Here are some examples:

Particularly interesting are those reviews that are misclassified. Here are a few of those:

Notice that the probabilty for one class often shoots up to almost one, but this is not permanent and can often change just based on the addition of one or two key words from the other class of products.

## Transfer learning of embeddings

One of the most power features of using neural networks for image processing was the ability to use transfer learning. This is also the case for working with word embeddings.

Unlike the CNN models, there are no pre-trained word embeddings
in **keras**. We need to use a seperate package to compute these
embeddings. One package is my own **fasttextM**. It is particularly
nice because it allows for doing word embeddings in a number of
languages into a common space. That is, we would expect that:

For some relatively small value of epsilon. You can download models
using the function `ft_download_model`

(this takes a while but needs
to be done only once):

Then, load each model you want to work with using `ft_load_model`

:

Finally, the function `ft_embed`

takes a vector of words and returns
a 300 column matrix giving the embedding of each term:

Several functions exist for understanding the structure of the word
embedding. The function `ft_nn`

gives the nearest neighbor terms for
each input:

If another language model is loaded, you can look up the nearest neighbors in the other language:

Here we see that the model generally maps to translations and other similar terms. I know we have a large number of students from China or majoring in Chinese. Perhaps you can help tell me if this makes any sense:

My quick google search for 狗熊 showed found this bear, so I assume the translation is not perfect, but is at least generally mapping animals to animals.

We can now apply this pre-trained word embedding to our Amazon product data. This code embeds all of the available terms that we have, setting missing terms to zero:

We now can create a neural network that has an LSTM layer but no
embedding layer. Instead, we will pass `X_embed`

directly as an
input.

Once again, we compile the model, create input data, and train:

The model now has the best prediction rate of all our models so far.

We could probably do better by increasing the number of terms in the sequences and not filtering out words from the top 5000 (I did the latter to simplify the code, but with the fasttext word embedding there is no need).

## Resources

If you would like a good, comprehensive, and empirical evaluation of the various tweaks to these recurrent structures, I recommend this paper:

- Greff, Klaus, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink,
and Jürgen Schmidhuber. “LSTM: A search space odyssey.”
*arXiv preprint*arXiv:1503.04069 (2015).

As well as this article:

- Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever.
“An empirical exploration of recurrent network architectures.”
In
*Proceedings of the 32nd International Conference on Machine Learning*(ICML-15), pp. 2342-2350. 2015.

Though, once you fully understand the LSTM model, the specifics amongst the competing approaches typically do not require understanding any new big ideas.