Natural language processing, or NLP, refers to the process of
teaching machines how to understand language. This goes
far beyond the simple models that we have so far built. Consider
this classic example sentence:
I forgot the papers.
Only four words here, but let’s think about how we as humans
would understand the sentence. The word “I” refers to the speaker.
The word forgot is a verb that has four important elements to
its lemma, “forget”, has several definitions that all
reduce to some form of not remembering about to do something
the tense indicates that the action occured in the past
the placement right after “I” indicates that the speaker
is the one who “forgot””
the placement of the determinant and noun “the papers”
afterwards indicates that this is the object that the speaker
The word “the” is associated with “papers”; it is a definite
article and shows that “papers” refers to a specific item
rather than a generic object. Finally, “papers” indicates
the a set of many objects (from the “s”) of type “paper”.
Phew! There is actually a lot going on in this short sentence.
And so far we have just looked at the sentence semantics,
what it literally is saying. Pragmatics goes even farther
and describes what this sentence means within a broader context.
For example, the sentence above could mean several things
depending on context:
I forgot to mix up the morning newspapers.
I forgot about the papers (perhaps that needed grading).
I forgot to bring the paperwork for a meeting.
Getting computers to understand all of these nuances is a
very open question and certainly not something we will
get into the details of right now. However, we can get a
little bit closer to understand than our current pipeline
for processing text. Hopefully, at the end of the semester,
we will return to these ideas and be able to get just a
bit closer to parsing textual data the way that humans
We can split semantics into two broad categories: grammatical
relationships and word meanings. The first is the easiest to
learn computationally because we have a relatively simple way
of describing grammatical relationships. These can further
be broken down into several steps:
tokenization: splitting raw characters into sentences
lemmatization: finding the root of a word, such as the
infinitive version of a verb or the singular version of a
part of speech tagging: associating a part of speech
to each token
dependency parsing: finding binary relationships
between words, such as noun-verb and subject-verb relations
We have so far only used tokenization in our analyses. To
get these deeper attributes we need a more powerful library.
I have wrote the package cleanNLP for just this purpose.
Unlike smodels, this is not just meant for our class; it
something meant for public consumption. You can find more
details in the paper here:
Setting up the basic R package is easy, but I won’t force you
all to do this because you also need to correctly set up the
spaCy library in Python which can be a bit of a pain.
To run cleanNLP, we can do the following:
Notice that this captures all of the grammatical information,
but nothing about the meaning of the words themselves. The
field upos gives a coarse universal part of speech code
whereas pos gives a more specific speech code specific to
English. Specific things captured by the model:
“I” is a pronoun
“I” is the subject of the verb “forgot”
“forgot” is the past tense (“VBD”) of the verb “forget”
“the” is the determinant of the noun “paper”
“papers” is the plural (“NNS” is plural) of “paper”
“papers” is the direct object of the verb forgot
This grammatical information is very useful for many tasks that
fall outside of our predictive modelling framework. For example,
parsing the sentence is useful for document summarization,
classification, information extraction, and for building
question-answer type chat bots.
The grammatical units can also be used in the kinds of predictive
document classification tasks we have seen so far. We will now
briefly look at these.
Authorship with tokens
Today, we will look at short writtings from five well-known
As a first step, let us build a term-frequency based
data matrix X as usual.
The resulting regression model is quite large. We will
look at just the first 50 rows.
Notice that many of these words are the most frequent terms
rather than relatively rare ones. This fits on theories that
writing style is most determined by how often one uses
Evaluating the model, we see that it is reasonably predictive
but quite overfit to the data.
We see from the confusion matrix that some authors are
more difficult to tell apart from others.
Remember, there are $5$ classes, so a rate of 57% on the
validation set is actually fairly good. The reason for
overfitting is that a given novel or story is completely
contained within one of the train/test/valid splits. this
is similar to the splitting done with the State of the
Stylometry is the study of linguistic style. This is
more or less what we are attempting to do with this dataset.
Was the classification of the State of the Union Addresses
stylometry? Well, sort of. We wanted to predict authorship
but this came from both stylistic features as well as topic
based features. That will still be somewhat the case with
this data (H.G. Wells writes about different topics than
Jane Austen), but primarily we hope that the features will
indicate more about writing style than just the topics
To get a better sense of style, let us grab the annotations
from the cleanNLP package. I have put these online so
that we do not need to set-up and run the package ourselves.
How might we use these in predicting authorship? One approach is
to look at the patterns of part of speech codes. Let’s reconstruct
the text using just the universal part of speech values:
And we do see differences based on the usage of each part
The overall classification is not great though, at only 35%
on the validation set.
We can improve the prediction by considering patterns of
parts of speech code. Let’s look at the tri-grams:
The predictive power is substantially improved, though
still not quite as good as the approach based on the
A good technique is to combine the features from this model
with the features from the token based model. For example:
And we see that the result do improve on those from the
words alone, albeit only slightly.
This approach can be extended to fitting models using the
pos tags, filtering the word lists based on part of speech
codes, using lemmatized word forms, or involving the
dependencies structure into the model. Note that when you
are using character shingles, you are partially approximating
part of speech codes. Verbs end in “ing “ and adjectives end
in “ly “. To simplify things for the next lab, you are
not allowed to use character shingles and must stick to
token based algorithms.
Visualization and Unsupervised Learning
One difficulty with textual data is that even once we
have constructed nice features there is no easy way of
plotting the data because there are too many variables
to plot scatter plots with all of them. Here, we will
see some visualization techniques that can be applied
to any high dimensional dataset.
Principal Component Analysis (PCA)
Principal component analysis is a linear technique for
dimensionality reduction. For a fixed number of dimensions
d, we find the optimal d dimensional hyperplane that describes
the largest possible amount of variation in the dataset. This
proceeds greedily, as follows:
the first principal component (PC1) is a weighted sum of
the columns of X that has the maximum achievable variance
the second principal component (PC2) is a weighted sum of
the columns of X that has the maximum achievable variance
subject to being perpendicular to PC1
the third principal component (PC3) is a weighted sum of
the columns of X that as the maximum achieveable variance
subject to being perpendicular to PC1 and PC2
And so forth for all p of the principal component. By looking
at just the top components we can often visualize the
high dimensional data in an interesting way.
Here, I will use the irlba library to efficently compute
the first two principal components for the stylo dataset.
We will then plot the results using color to denote the
Most of the interesting behavior comes from points near
the origin so let’s zoom in a bit:
There appears to be a fairly clear difference between the
Jane Austin texts (category 1) and the rest. There may be
other patterns, but these are slightly harder to find.
PCA is a great linear technique for dimensionality reduction.
If we want to find non-linear dimensionality reduction, a
popular technique is t-SNE. A fast implemenetation is available
with R’s Rtsne
This does an even better job of spreading the data out over
the space and allowing us to visualize the changes over time.
The singular value decomposition write a matrix as a
product of three new matricies as follows:
The matrices U and V are unitary matricies (you can
think of them representing rotations) and the matrix D
is a diagonal matrix. The elements of the diagonal matrix
D are are all non-negative and arranged from smallest to
largest. This gives a way of approximating the matrix
by truncated all but the top k values of D to zero.
The resulting product then writes X as the product of
U: one row per document and k columns
V: one row per term and k columns
These can be visualized as follows:
The irlba package can compute the svd of a large
matrix (in fact, PCA can be written in terms of the
SVD). Make sure that the following dimensions match
what you would expect them to be.
We can view the k dimensions as topics, groups of words
that tend to occur together within a document.
This is an example of a task that would be better if we
filtered by parts of speech. So let’s redo it with
learned part of speech codes.
Here we can see which topic is most associated with each speech:
For next time
We have covered a lot of ground today in terms of new topics,
however unlike other classes not all of these relate as directly
to predictive modelling. That is okay. Hopefully you are here to
learn broadly about modelling data in all its forms.
The lab for today uses a collection of blog posts where you need
to predict whether the author is a teenager or not. You are not
allowed to use character n-grams. Try to focus on not just predicting
well but also looking at the output.
Next class we will begin looking at images, starting with the ones
you made last week. I’ll also describe details for the final project.