Thank you all for the really great feedback on the course.
It seem that things are generally going well. A few themes
help with how to select tuning parmeters and more on what
they are doing
more references to other material
posting more student solutions
perhaps some more interactive class time
labs first works well
enjoying the datasets
pace seems about right
For the first note, I’ll reiterate that I try to explain roughly
what each parameter is directly doing to the model. Why we
might change it and what the best value should be a is a hard
problem. Generally I use the validation set to tell me good
values. Hopefully over time you gain some intuition from seeing
lots of problems what values work well.
For the second, I should do a better job with this. Here are
a few references that are really great for understanding the
main packages we have been using:
Both are large pdfs. The second one is much more theoretical
than the first.
Once we get to neural networks for text and image processing
I will point you all to many more papers and references as
these two fields have changed rapidly over the past 5 years
and it is hard for textbooks to keep up.
If there is anything I can help you in learning the material
for class, please let me know!
Amazon Classification – Take Two
Let’s look at the Amazon classification data again:
For the lab, you all should have created features that you thought might
be useful to classification task. Can we do this in a more systematic
The basic approach is to create features for all (or the most common)
words that occur in a collection of texts and then use a technique such
as the elastic net that can be run with a large number of variables.
The first step is to use the tokenize_characters function from the
tokenizers package to break each text into characters:
We then use the term_list_to_df function from smodels to
convert this into a feature data frame with one token per row:
And, finally, we can get a response matrix giving counts for each of these
characters with the smodels function term_df_to_matrix. In theory, the
result has one row per spam message and one column for each unique character.
However, by default, only the most frequent 10000 terms are used (usually those
rare terms are proper nouns or mis-spellings that occur in only 1 or 2 texts
Here are the first few rows and columns. The columns are ordered by frequency.
So, for example, the 10th text used the word “the” 8 times. Notice that
this function creates a sparse matrix, where non-zero terms are not
show. This is very useful here because over 99% of the terms in the
matrix X are zero:
Using this matrix as a design matrix, we can then create our usual
training and validation sets:
As mentioned above, the elastic net is a fantastic model for this
Neat, right? What are some interesting patterns you see in the output?
Let’s see how well does this model stack up to the predictions from today.
As you all wait to hand in the assignment at the last possible minute, I
don’t know yet, but hopefully fairly well.
Our text prediction task for today involves something called authorship
detection. Given a short snippet of text, predict who wrote or said it.
The data for today comes from State of the Union Addresses. Each speech
was broken up into small snippets; we want to detect which president
is associated with each bit of text.
Here, I gave a class name in addition to a numeric value to make it
easy to look at the results. The data comes from our past three presidents:
For example, here is a bit of text from George W. Bush:
For fairness, the train/valid/test split was done by year. That is,
every snippet in a particular year’s speech was in exactly one of
the three groups. This prevents us from learning features that will
not be useful outside of the corpus (such as particular names of
people or specific issues that were relevant only at one moment in
Let’s build another feature matrix using the words in this corpus.
I’ll put all the steps in one block to make it easier for you to
copy and adapt in your own work.
I have added the options min_df and max_df to filter to only terms
that are used in at least 1% of the documents but no more than 90% of
the documents. These filters help make the computations much faster.
I also scaled the rows of the data; this makes the counts frequencies
rather than occurances. Generally, I play around with whether that
improves my model or not.
We’ll use the elastic net again:
I toyed with the choice of lambda to find a good set that showed what
features the model is picking up the most:
At least some of these should seem unsurprising.
The classification rate is not bad either, though there is clearly
overfitting. This is likely due to use of cross-validation over a
non-random train/validation split.
One major change we can make to the above modelling approach is to
split the text into something other than words. One common altnerative
is to tokenize into ngrams; these are groups of n concurrent words
rather than individual ones. Using two words togther is called a
bigram, three a trigram, and so forth. We can access these using
the tokenize_ngrams function. By setting n_min equal to 1, I make
sure to also get the single words (or unigrams):
Because of the aggresive filtering rules, the number of included
bigrams is not too much larger than the original data matrix. Here
are some of the bigrams that were included:
We can use this new data matrix as before by passing it to the glmnet
Here are the terms picked out by the elastic net model:
There are certainly some interesting bigrams here, such as “saddam hussein”,
and “social security”. The first one is mostly useful in
describing to us what the model has found; the second probably helps to
distinguish the seperate meanings of “social”, “security”, and “social security”.
While the training set fits better, we’ve mostly just helped to
overfit the original data. In many cases, however, bigrams and
trigrams are quite powerful. We’ll see over the next few classes
exactly when and where each of these is most useful.
Another way to split apart the text is to break it into individual
characters. This would have helped to find the exclamation mark and
pound symbol in the spam example, for instance. Individual characters
can be put together in much the same way as bigrams and trigrams.
These are often called character shingles. We can get them in R
by using the tokenize_character_shingles function. Here, we’ll get
all shingles from 1 to 3 characters wide. I like to include the
non-alphanumeric (i.e., numbers and letters) characters, but feel free
to experiment with excluding them.
The number of columns is almost twice as large as the bigrams model.
We can plug it into the elastic net once more:
The coefficents are not quite as interesting here, however, as it is
hard to figure out exactly what each feature is picking up:
Predicting on the data we see that the model performs very well
even though we might not understand it. It is not quite as good
as the bigram model, however.
In the lab for next class you’ll be given a version of this
dataset that contains 5 (different) presidents. Try to experiment
with this automatic functions for extracting features and creating
data matrices from text.