U.S. Authors

In this lab we will look at a corpus of short snippets of novels from three U.S. authors. The data have already been split into a training and validation set. All we need to do is read it in:

set.seed(1)

us <- read_csv(file.path("data", "stylo_us.csv"))
head(us)
## # A tibble: 6 x 4
##   doc_id   train_id author   text                                               
##   <chr>    <chr>    <chr>    <chr>                                              
## 1 id_0000… valid    Hawthor… "At first, his expression had been calm, meditativ…
## 2 id_0000… train    Poe      "Ah, less--less bright The stars of the night Than…
## 3 id_0000… train    Poe      "\"Oh, yes,\" he replied, coloring violently, \"I …
## 4 id_0000… train    Twain    "There was as many as one loafer leaning up agains…
## 5 id_0000… train    Hawthor… "He's the greatest man of this or any other age, b…
## 6 id_0000… train    Twain    "His spirits sank lower and lower as he moved betw…

I have also already run the corpus through the spaCy annotation engine and saved the tokens data set for you to work with in R. You can read it in with the following:

token <- read_csv(file.path("data", "stylo_us_token.csv.gz"))
head(token)
## # A tibble: 6 x 10
##   doc_id    sid   tid token  token_with_ws lemma upos  xpos  tid_source relation
##   <chr>   <dbl> <dbl> <chr>  <chr>         <chr> <chr> <chr>      <dbl> <chr>   
## 1 id_000…     1     1 At     At            at    ADP   IN             7 prep    
## 2 id_000…     1     2 first  first         first ADV   RB             1 pcomp   
## 3 id_000…     1     3 ,      ,             ,     PUNCT ,              7 punct   
## 4 id_000…     1     4 his    his           -PRO… DET   PRP$           5 poss    
## 5 id_000…     1     5 expre… expression    expr… NOUN  NN             7 nsubj   
## 6 id_000…     1     6 had    had           have  AUX   VBD            7 aux

Understanding the data

Before jumping into any other coding, use the three blocks below to randomly sample (using sample_n) 5 text documents from each of the three authors in the data: “Hawthorne”, “Poe”, “Twain”.

us %>%
  filter(author == "Hawthorne") %>%
  sample_n(size = 5) %>%
  use_series("text")
## [1] "Such voices have put on mourning for dead hopes; and they ought to die and be buried along with them! Discerning that Clifford was not gladdened by her efforts, Hepzibah searched about the house for the means of more exhilarating pastime. At one time, her eyes chanced to rest on Alice Pyncheon's harpsichord. It was a moment of great peril; for,--despite the traditionary awe that had gathered over this instrument of music, and the dirges which spiritual fingers were said to play on it,--the devoted sister had solemn thoughts of thrumming on its chords for Clifford's benefit, and accompanying the performance with her voice. Poor Clifford!"                                                       
## [2] "\"What means the Bedlamite by this freak?\" \"Nay,\" answered Lady Eleanore playfully, but with more scorn than pity in her tone, \"your Excellency shall not strike him. When men seek only to be trampled upon, it were a pity to deny them a favor so easily granted--and so well deserved!\" Then, though as lightly as a sunbeam on a cloud, she placed her foot upon the cowering form, and extended her hand to meet that of the Governor. There was a brief interval, during which Lady Eleanore retained this attitude; and never, surely, was there an apter emblem of aristocracy and hereditary pride trampling on human sympathies and the kindred of nature, than these two figures presented at that moment."
## [3] "A fragrance was diffused from it which Giovanni recognized as identical with that which he had attributed to Beatrice's breath, but incomparably more powerful. As her eyes fell upon it, Giovanni beheld her press her hand to her bosom as if her heart were throbbing suddenly and painfully. \"For the first time in my life,\" murmured she, addressing the shrub, \"I had forgotten thee.\" \"I remember, signora,\" said Giovanni, \"that you once promised to reward me with one of these living gems for the bouquet which I had the happy boldness to fling to your feet. Permit me now to pluck it as a memorial of this interview.\""                                                                           
## [4] "This kind of dreamy feeling always comes over me before any wonderful occurrence. If you take my advice, you will turn back.\" \"No, no,\" answered his comrades, snuffing the air, in which the scent from the palace kitchen was now very perceptible. \"We would not turn back, though we were certain that the king of the Laestrygons, as big as a mountain, would sit at the head of the table, and huge Polyphemus, the one-eyed Cyclops, at its foot.\" At length they came within full sight of the palace, which proved to be very large and lofty, with a great number of airy pinnacles upon its roof."                                                                                                         
## [5] "In his natural system, though high-wrought and delicately refined, a sensibility to the delights of the palate was probably inherent. It would have been kept in check, however, and even converted into an accomplishment, and one of the thousand modes of intellectual culture, had his more ethereal characteristics retained their vigor. But as it existed now, the effect was painful and made Phoebe droop her eyes. In a little while the guest became sensible of the fragrance of the yet untasted coffee. He quaffed it eagerly."
us %>%
  filter(author == "Poe") %>%
  sample_n(size = 5) %>%
  use_series("text")
## [1] "I will tell you. My reward shall be this. You shall give me all the information in your power about these murders in the Rue Morgue.\" Dupin said the last words in a very low tone, and very quietly. Just as quietly, too, he walked toward the door, locked it and put the key in his pocket."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [2] "The one finished by complete failure what he commenced in the grossest misconception; the other, by a path which could not possibly lead him astray, arrived at a triumph which is not the less glorious because hidden from the profane eyes of the multitude. But in this view even the \"metaphysical verse\" of Cowley is but evidence of the simplicity and single-heartedness of the man. And he was in this but a type of his school-for we may as well designate in this way the entire class of writers whose poems are bound up in the volume before us, and throughout all of whom there runs a very perceptible general character. They used little art in composition. Their writings sprang immediately from the soul-and partook intensely of that soul's nature."
## [3] "[ This reply startled me very much.  ]  P.   What then is he?   [ After a long pause, and mutteringly.  ] I see--but it is a thing difficult to tell."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [4] "Had the pirate recovered his money, there the affair would have dropped. It seemed to me that some accident--say the loss of a memorandum indicating its locality--had deprived him of the means of recovering it, and that this accident had become known to his followers, who otherwise might never have heard that treasure had been concealed at all, and who, busying themselves in vain, because unguided attempts, to regain it, had given first birth, and then universal currency, to the reports which are now so common. Have you ever heard of any important treasure being unearthed along the coast?\" \"Never.\" \"But that Kidd's accumulations were immense, is well known."                                                                                   
## [5] "E  predominates so remarkably that an individual sentence of any length is rarely seen, in which it is not the prevailing character. \"Here, then, we leave, in the very beginning, the groundwork for something more than a mere guess. The general use which may be made of the table is obvious--but, in this particular cipher, we shall only very partially require its aid. As our predominant character is 8, we will commence by assuming it as the  e  of the natural alphabet. To verify the supposition, let us observe if the 8 be seen often in couples--for  e  is doubled with great frequency in English--in such words, for example, as 'meet,' '.fleet,' 'speed,' 'seen,' been,' 'agree,' &c."
us %>%
  filter(author == "Twain") %>%
  sample_n(size = 5) %>%
  use_series("text")
## [1] "\"No'm. In Hookerville, seven mile below. I've walked all the way and I'm all tired out.\" \"Hungry, too, I reckon. I'll find you something.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [2] "When we was three or four hundred yards down-stream we see the lantern show like a little spark at the texas door for a second, and we knowed by that that the rascals had missed their boat, and was beginning to understand that they was in just as much trouble now as Jim Turner was. Then Jim manned the oars, and we took out after our raft. Now was the first time that I begun to worry about the men--I reckon I hadn't had time to before. I begun to think how dreadful it was, even for murderers, to be in such a fix. I says to myself, there ain't no telling but I might come to be a murderer myself yet, and then how would I like it?"                                                                                                                                                                                                                                                                                           
## [3] "Napoleon and all his kind stood accounted for--and justified. When Rowena had at last done all her duty by the people in the parlor, she went upstairs to satisfy the longings of an overflow meeting there, for the parlor was not big enough to hold all the comers. Again she was besieged by eager questioners, and again she swam in sunset seas of glory. When the forenoon was nearly gone, she recognized with a pang that this most splendid episode of her life was almost over, that nothing could prolong it, that nothing quite its equal could ever fall to her fortune again. But never mind, it was sufficient unto itself, the grand occasion had moved on an ascending scale from the start, and was a noble and memorable success."                                                                                                                                                                                                
## [4] "(Bronze statue.) We look at it indifferently and the doctor asks: \"By Michael Angelo?\" \"No--not know who.\" Then he shows us the ancient Roman Forum. The doctor asks: \"Michael Angelo?\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [5] "Sinners that have been kept down and had examples held up to them, and suffered frequent lectures, and been so put upon in a moral way and in the matter of going slow and being serious and bottling up slang, and so crowded in regard to the matter of being proper and always and forever behaving, that their lives have become a burden to them, would not lag behind pilgrims at such a time as this, and wink furtively, and be joyful, and commit other such crimes--because it would not occur to them to do it. Otherwise they would. But they did do it, though--and it did them a world of good to hear the pilgrims abuse each other, too. We took an unworthy satisfaction in seeing them fall out, now and then, because it showed that they were only poor human people like us, after all. So we all rode down to Magdala, while the gnashing of teeth waxed and waned by turns, and harsh words troubled the holy calm of Galilee."

You can see that the documents are just short snippets from various novels.

Building a predictive model

Now, build a TF matrix using the parsed data and all of the available tokens. I recommend for now setting the minimum threshold for including a term to 0.001 and the maximum to be 0.5. Print out the dimension of the data:

X <- token %>%
  cnlp_utils_tf(doc_set = us$doc_id,
                min_df = 0.001,
                max_df = 0.5,
                max_features = 10000)

X_train <- X[us$train_id == "train", ]
y_train <- us$author[us$train_id == "train"]

dim(X)
## [1] 7500 6131

Now, fit an elastic net model using this data with alpha equal to 0.9 and three folds.

model <- cv.glmnet(
  X_train, y_train, alpha = 0.9, nfolds = 3, family = "multinomial"
)

And then compute the classification rate on the training and validation data:

us %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(author == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.999
## 2 valid         0.824

Next, produce a confusion matrix:

us %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  select(author, pred, train_id) %>%
  table()
## , , train_id = train
## 
##            pred
## author      Hawthorne  Poe Twain
##   Hawthorne      1899    0     1
##   Poe               0 1898     2
##   Twain             0    1  1899
## 
## , , train_id = valid
## 
##            pred
## author      Hawthorne  Poe Twain
##   Hawthorne       379  165    56
##   Poe              14  570    16
##   Twain            17   48   535

You should find that the model almost perfects predicts the training data but performs much worse on the validation data. What is going on here? Let’s look at the terms that are the most powerful in the penalized regression model. Below, display the coefficients by picking a lambda that yields around 20 terms:

temp <- coef(model, s = model$lambda[15])
beta <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
colnames(beta) <- names(temp)
beta
## 19 x 3 sparse Matrix of class "dgCMatrix"
##               Hawthorne           Poe       Twain
## (Intercept)  0.02012098 -0.0028985428 -0.01722244
## as           0.04357501  .             .         
## which        .           .            -0.21895946
## --          -0.17008533  0.0033418423  .         
## do           .           .             0.07900562
## 's           0.05631494 -0.0007051838  .         
## upon         .           0.0785367146  .         
## out          .           .             0.04099739
## go           .           .             0.05474182
## get          .           .             0.42748021
## :            .           .             0.27255463
## (            .           0.2103109589  .         
## )            .           0.0167664753  .         
## child        0.13021665  .             .         
## Tom          .           .             0.02822213
## Hepzibah     0.67474414  .             .         
## Phoebe       0.20198398  .             .         
## Pyncheon     0.36897412  .             .         
## Clifford     0.12212865  .             .

What type of terms show up here? Answer I found two different types: the first are common function words and punctuation marks. The second are proper names of characters.

The difficulty with this task is that the data have not been split into training and validation at random. Rather, each novel has been associated with the entire training or validation data set. This means that a character might be in a novel that is in many fragments from one one (such as “Tom” from Mark Twain or “Clifford” from Nathaniel Hawthorne), but that this will not be useful in the validation data.

POS Filtering

In the code block below, create a TF matrix of just lemmas that are adjectives (“ADJ”). This time do not cap the maximum proportion of documents that can contain a token. Fit an elastic net as above and print out the classification rate on the training and validation data. For this model (and the following ones), use max_df = 1.

X <- token %>%
  filter(upos == "ADJ") %>%
  cnlp_utils_tf(doc_set = us$doc_id,
                min_df = 0.001,
                max_df = 1,
                max_features = 10000)

X_train <- X[us$train_id == "train", ]
y_train <- us$author[us$train_id == "train"]

model <- cv.glmnet(
  X_train, y_train, alpha = 0.9, nfolds = 3, family = "multinomial"
)

us %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(author == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.733
## 2 valid         0.651

Repeat with the “VERB” lemmas:

X <- token %>%
  filter(upos == "VERB") %>%
  cnlp_utils_tf(doc_set = us$doc_id,
                min_df = 0.001,
                max_df = 1,
                max_features = 10000)

X_train <- X[us$train_id == "train", ]
y_train <- us$author[us$train_id == "train"]

model <- cv.glmnet(
  X_train, y_train, alpha = 0.9, nfolds = 3, family = "multinomial"
)

us %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(author == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.826
## 2 valid         0.708

And again, using the “PUNCT” marks:

X <- token %>%
  filter(upos == "PUNCT") %>%
  cnlp_utils_tf(doc_set = us$doc_id,
                min_df = 0.001,
                max_df = 1,
                max_features = 10000)

X_train <- X[us$train_id == "train", ]
y_train <- us$author[us$train_id == "train"]

model <- cv.glmnet(
  X_train, y_train, alpha = 0.9, nfolds = 3, family = "multinomial"
)

us %>%
  mutate(pred = predict(model, newx = X, type = "class")) %>%
  group_by(train_id) %>%
  summarize(class_rate = mean(author == pred))
## # A tibble: 2 x 2
##   train_id class_rate
##   <chr>         <dbl>
## 1 train         0.618
## 2 valid         0.565

How would you compare the error rates of these models? Answer Answers will vary.

Exploring the NLP data

For the remainder of the lab, we are going to use the data set to get familiar with the features that are available for building machine learning models. You may be inclined to rush, but try to take your time here.

Lemmas

Let’s start by seeing how the NLP engine has lemmatized the tokens. Filter the tokens to just those that have a lemma equal to “sit”. Group by the universal part of speech and token, count the number of occurances, and arrange in descending order of the count. Take note of all of the tokens that get turned into the lemma “sit”.

token %>%
  filter(lemma == "sit") %>%
  group_by(upos, token) %>%
  summarize(sm_count()) %>%
  arrange(desc(count))
## # A tibble: 8 x 3
## # Groups:   upos [2]
##   upos  token   count
##   <chr> <chr>   <int>
## 1 VERB  sat       202
## 2 VERB  sit        79
## 3 VERB  sitting    78
## 4 VERB  sits       16
## 5 VERB  Sit         9
## 6 NOUN  sits        1
## 7 VERB  Sat         1
## 8 VERB  Sitting     1

Repeat for the more dynamic verb “have”. Notice the token “’ve” coming from the contraction “you’ve” and “’d” from “you’d”, “he’d”, or “she’d”.

token %>%
  filter(lemma == "have") %>%
  group_by(upos, token) %>%
  summarize(sm_count()) %>%
  arrange(desc(count))
## # A tibble: 13 x 3
## # Groups:   upos [2]
##    upos  token  count
##    <chr> <chr>  <int>
##  1 AUX   had     5828
##  2 AUX   have    3499
##  3 AUX   has      979
##  4 VERB  having   331
##  5 AUX   've      138
##  6 VERB  had      138
##  7 VERB  Having    91
##  8 AUX   Had       74
##  9 AUX   'd        54
## 10 AUX   Have      47
## 11 AUX   Has       20
## 12 VERB  HAVE       1
## 13 VERB  havin'     1

Let’s see what happens to a noun. Repeat the process for the term “bird”:

token %>%
  filter(lemma == "bird") %>%
  group_by(upos, token) %>%
  summarize(sm_count()) %>%
  arrange(desc(count))
## # A tibble: 2 x 3
## # Groups:   upos [1]
##   upos  token count
##   <chr> <chr> <int>
## 1 NOUN  bird     76
## 2 NOUN  birds    56

Notice that it includes the singular and plural form of the word.

Finally, do the same with the lemma “-PRON-”.

token %>%
  filter(lemma == "-PRON-") %>%
  group_by(upos, token) %>%
  summarize(sm_count()) %>%
  arrange(desc(count))
## # A tibble: 60 x 3
## # Groups:   upos [2]
##    upos  token count
##    <chr> <chr> <int>
##  1 PRON  I     10943
##  2 PRON  it     8870
##  3 DET   his    7458
##  4 PRON  he     6405
##  5 DET   my     3866
##  6 PRON  you    3517
##  7 PRON  him    3407
##  8 PRON  me     2790
##  9 PRON  we     2691
## 10 DET   her    2665
## # … with 50 more rows

You should notice that spaCy has turned all pronouns into a generic lemma “-PRON-”. This is often not a good idea for predictive models and is something that we will sometimes want to adjust before creating a model.

XPOS Tags

In the notes for today, we saw the top lemmas associated with each UPOS tag. Repeat the process here with the xpos tags and the tokens (rather than lemmas). It is probably useful to group by both the upos and xpos tags (group_by(upos, xpos, token)).

token %>%
  group_by(upos, xpos, token) %>%
  summarize(sm_count()) %>%
  arrange(desc(count)) %>%
  slice_head(n = 8) %>%
  summarize(sm_paste(token)) %>%
  rmarkdown::paged_table()        # just for the notes to make the data viewable

Go through the list and try to understand what each of the tags captures. Use the spaCy reference or ask me to explain any that you are having trouble with.