# Class 16: Natural Language Processing

## Natural Language Processing

Natural language processing, or NLP, refers to the process of teaching machines how to understand language. This goes far beyond the simple models that we have so far built. Consider this classic example sentence:

I forgot the papers.

Only four words here, but let’s think about how we as humans would understand the sentence. The word “I” refers to the speaker. The word forgot is a verb that has four important elements to it:

• its lemma, “forget”, has several definitions that all reduce to some form of not remembering about to do something
• the tense indicates that the action occured in the past
• the placement right after “I” indicates that the speaker is the one who “forgot””
• the placement of the determinant and noun “the papers” afterwards indicates that this is the object that the speaker forgot

The word “the” is associated with “papers”; it is a definite article and shows that “papers” refers to a specific item rather than a generic object. Finally, “papers” indicates the a set of many objects (from the “s”) of type “paper”.

Phew! There is actually a lot going on in this short sentence. And so far we have just looked at the sentence semantics, what it literally is saying. Pragmatics goes even farther and describes what this sentence means within a broader context. For example, the sentence above could mean several things depending on context:

• I forgot to mix up the morning newspapers.
• I forgot to bring the paperwork for a meeting.

Getting computers to understand all of these nuances is a very open question and certainly not something we will get into the details of right now. However, we can get a little bit closer to understand than our current pipeline for processing text. Hopefully, at the end of the semester, we will return to these ideas and be able to get just a bit closer to parsing textual data the way that humans do.

## Text annotation

We can split semantics into two broad categories: grammatical relationships and word meanings. The first is the easiest to learn computationally because we have a relatively simple way of describing grammatical relationships. These can further be broken down into several steps:

• tokenization: splitting raw characters into sentences and words
• lemmatization: finding the root of a word, such as the infinitive version of a verb or the singular version of a noun
• part of speech tagging: associating a part of speech to each token
• dependency parsing: finding binary relationships between words, such as noun-verb and subject-verb relations

We have so far only used tokenization in our analyses. To get these deeper attributes we need a more powerful library. I have wrote the package cleanNLP for just this purpose. Unlike smodels, this is not just meant for our class; it something meant for public consumption. You can find more details in the paper here:

Setting up the basic R package is easy, but I won’t force you all to do this because you also need to correctly set up the spaCy library in Python which can be a bit of a pain.

To run cleanNLP, we can do the following:

Notice that this captures all of the grammatical information, but nothing about the meaning of the words themselves. The field upos gives a coarse universal part of speech code whereas pos gives a more specific speech code specific to English. Specific things captured by the model:

• “I” is a pronoun
• “I” is the subject of the verb “forgot”
• “forgot” is the past tense (“VBD”) of the verb “forget”
• “the” is the determinant of the noun “paper”
• “papers” is the plural (“NNS” is plural) of “paper”
• “papers” is the direct object of the verb forgot

This grammatical information is very useful for many tasks that fall outside of our predictive modelling framework. For example, parsing the sentence is useful for document summarization, classification, information extraction, and for building question-answer type chat bots.

The grammatical units can also be used in the kinds of predictive document classification tasks we have seen so far. We will now briefly look at these.

## Authorship with tokens

Today, we will look at short writtings from five well-known British authors.

As a first step, let us build a term-frequency based data matrix X as usual.

The resulting regression model is quite large. We will look at just the first 50 rows.

Notice that many of these words are the most frequent terms rather than relatively rare ones. This fits on theories that writing style is most determined by how often one uses function words.

Evaluating the model, we see that it is reasonably predictive but quite overfit to the data.

We see from the confusion matrix that some authors are more difficult to tell apart from others.

Remember, there are $5$ classes, so a rate of 57% on the validation set is actually fairly good. The reason for overfitting is that a given novel or story is completely contained within one of the train/test/valid splits. this is similar to the splitting done with the State of the Union addresses.

## Stylometry

Stylometry is the study of linguistic style. This is more or less what we are attempting to do with this dataset.

Was the classification of the State of the Union Addresses stylometry? Well, sort of. We wanted to predict authorship but this came from both stylistic features as well as topic based features. That will still be somewhat the case with this data (H.G. Wells writes about different topics than Jane Austen), but primarily we hope that the features will indicate more about writing style than just the topics featured.

To get a better sense of style, let us grab the annotations from the cleanNLP package. I have put these online so that we do not need to set-up and run the package ourselves.

How might we use these in predicting authorship? One approach is to look at the patterns of part of speech codes. Let’s reconstruct the text using just the universal part of speech values:

And we do see differences based on the usage of each part of speech.

The overall classification is not great though, at only 35% on the validation set.

We can improve the prediction by considering patterns of parts of speech code. Let’s look at the tri-grams:

The predictive power is substantially improved, though still not quite as good as the approach based on the word counts.

A good technique is to combine the features from this model with the features from the token based model. For example:

And we see that the result do improve on those from the words alone, albeit only slightly.

This approach can be extended to fitting models using the pos tags, filtering the word lists based on part of speech codes, using lemmatized word forms, or involving the dependencies structure into the model. Note that when you are using character shingles, you are partially approximating part of speech codes. Verbs end in “ing “ and adjectives end in “ly “. To simplify things for the next lab, you are not allowed to use character shingles and must stick to token based algorithms.

## Visualization and Unsupervised Learning

One difficulty with textual data is that even once we have constructed nice features there is no easy way of plotting the data because there are too many variables to plot scatter plots with all of them. Here, we will see some visualization techniques that can be applied to any high dimensional dataset.

### Principal Component Analysis (PCA)

Principal component analysis is a linear technique for dimensionality reduction. For a fixed number of dimensions d, we find the optimal d dimensional hyperplane that describes the largest possible amount of variation in the dataset. This proceeds greedily, as follows:

• the first principal component (PC1) is a weighted sum of the columns of X that has the maximum achievable variance
• the second principal component (PC2) is a weighted sum of the columns of X that has the maximum achievable variance subject to being perpendicular to PC1
• the third principal component (PC3) is a weighted sum of the columns of X that as the maximum achieveable variance subject to being perpendicular to PC1 and PC2

And so forth for all p of the principal component. By looking at just the top components we can often visualize the high dimensional data in an interesting way.

Here, I will use the irlba library to efficently compute the first two principal components for the stylo dataset. We will then plot the results using color to denote the author.

Most of the interesting behavior comes from points near the origin so let’s zoom in a bit:

There appears to be a fairly clear difference between the Jane Austin texts (category 1) and the rest. There may be other patterns, but these are slightly harder to find.

### t-Distributed Stochastic Neighbor Embedding (t-SNE)

PCA is a great linear technique for dimensionality reduction. If we want to find non-linear dimensionality reduction, a popular technique is t-SNE. A fast implemenetation is available with R’s Rtsne

This does an even better job of spreading the data out over the space and allowing us to visualize the changes over time.

### The SVD

The singular value decomposition write a matrix as a product of three new matricies as follows:

The matrices U and V are unitary matricies (you can think of them representing rotations) and the matrix D is a diagonal matrix. The elements of the diagonal matrix D are are all non-negative and arranged from smallest to largest. This gives a way of approximating the matrix by truncated all but the top k values of D to zero. The resulting product then writes X as the product of two matricies:

• U: one row per document and k columns
• V: one row per term and k columns

These can be visualized as follows:

The irlba package can compute the svd of a large matrix (in fact, PCA can be written in terms of the SVD). Make sure that the following dimensions match what you would expect them to be.

We can view the k dimensions as topics, groups of words that tend to occur together within a document.

This is an example of a task that would be better if we filtered by parts of speech. So let’s redo it with learned part of speech codes.

Here we can see which topic is most associated with each speech:

## For next time

We have covered a lot of ground today in terms of new topics, however unlike other classes not all of these relate as directly to predictive modelling. That is okay. Hopefully you are here to learn broadly about modelling data in all its forms.

The lab for today uses a collection of blog posts where you need to predict whether the author is a teenager or not. You are not allowed to use character n-grams. Try to focus on not just predicting well but also looking at the output.

Next class we will begin looking at images, starting with the ones you made last week. I’ll also describe details for the final project.