library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)
library(cleanNLP)
library(glmnet)
library(magrittr)
library(stringi)

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)
options(sparse.colnames = TRUE)

Amazon Products

As a good example dataset, let’s again use the Amazon product classification data:

set.seed(1)

amazon <- read_csv("data/amazon_product_class.csv") %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
token <- read_csv("data/amazon_product_class_token.csv.gz")

Today we are going to discuss additional ways to construct your model matrix, building off of the techniques in the previous notebook. In the Notebook 8 pipeline, these will replace or augment Step 2. Note that many of these will be useful only in a certain set of applications. It is generally good to start with raw frequencies and move to other options given the specific goals you have and the data you are working with.

Adding Pronouns

The spaCy NLP pipeline converts all pronouns into a single lemma: “-PRON-”. I have always found this strange, and it is often not great for the types of analysis we are doing with the lemmas. To fix this, we can replace every pronoun lemma with it’s lower case token. Looking at the first few columns of the data matrix, we see that some of the pronouns are quite common:

X <- token %>%
  mutate(lemma = if_else(upos == "PRON", stri_trans_tolower(token), lemma)) %>%
  filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB", "PRON")) %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "lemma"
  )

X[1:10, 1:15]
## 10 x 15 sparse Matrix of class "dgCMatrix"
##          i it you good make they so can will book well very just more great
## doc00001 4  3   .    .    .    .  .   .    .    .    .    .    .    1     .
## doc00002 4  4   .    2    .    1  1   .    .    2    .    2    .    3     3
## doc00003 .  1   2    1    1    1  .   .    1    3    .    1    1    1     1
## doc00004 2  1   .    1    .    .  .   1    .    .    2    .    .    .     .
## doc00005 2  1   .    1    .    .  .   .    .    1    .    .    .    .     .
## doc00006 2  .   .    .    .    .  .   .    .    .    .    .    .    .     .
## doc00007 3  3   1    .    .    .  .   .    .    1    .    .    .    .     1
## doc00008 .  4   .    .    1    1  2   .    .    .    3    .    1    1     .
## doc00009 1  2   .    3    .    .  .   .    .    2    .    .    .    .     1
## doc00010 6  .   .    2    .    5  .   1    1    .    1    2    .    1     .

Using this in your model may improve the predictive power and show different writing strategies used in each category.

Sentence Number

Another technique is to restrict the sentences used in creating the term frequency matrix. This code, for example, will look at just the first sentence in each document.

X <- token %>%
  filter(sid == 1) %>%
  filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB")) %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "lemma"
  )

I find looking at the first 1 or 2 sentences can be useful when working with reviews. Often the first sentence gives the clearest signal of someone’s sentiment towards a product, which is then clarified and refined later in the body of the review.

With a slightly different approach, we can also look at the last sentence in each document:

X <- token %>%
  group_by(doc_id) %>%
  filter(sid == max(sid)) %>%
  filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB")) %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "lemma"
  )

This is often useful for looking at writing style, as particular people tend to end their reviews with similar language.

POS as a Feature

The code we have so far can be easily changed to look at the frequency of linguistic features other than lemmas. For example, we could look at how often each part of speech is used in the reviews by changing the token_var component in the cnlp_utils_tf function.

X <- token %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "upos",
    
  )

X[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##          NOUN PUNCT DET VERB ADJ ADP PRON AUX ADV CCONJ
## doc00001   11     8   2    8   1   7    7   3   2     3
## doc00002   67    62  44   36  37  37   18  18  16    12
## doc00003   24    14  13   15  10  18   10   7  14     2
## doc00004   36    22  31   23  17  24    8  11  13     7
## doc00005   26    10  20   14  11  13    4   5   5     2
## doc00006    7     3   3    3   2   2    2   2   2     2
## doc00007   18     8  10   11   4   8    9   7   2     1
## doc00008   85    94  55   37  37  44   17  27  32    15
## doc00009   12     9   8    9   7   4    5   7   .     4
## doc00010   22    21  14   18  23  10   13  14  10     7

With part of speech codes, it is often also useful to look at relatively large N-grams. As there are fewer options for the part of speech codes, it is not unreasonable to look at 4- or even 5-grams.

X <- token %>%
  sm_ngram(n = 4, n_min = 1, doc_var = "doc_id", token_var = "upos") %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "token"
  )

dim(X)
## [1]  8823 10000

Looking at POS N-grams is usually most useful to isolating writing style, and is a good way of distinguishing one author from another.

Tagging Lemmas with POS

Often a single lemma can be the same for different parts of speech, such as the noun “love” and the verb “love”. We can treat these as different by combining the lemma and the part of speech together into a single variable and using this as the thing we are counting in the data matrix.

X <- token %>%
  mutate(lemma = if_else(upos == "PRON", stri_trans_tolower(token), lemma)) %>%
  mutate(lemma = sprintf("%s_%s", lemma, upos)) %>%
  filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB", "PRON")) %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "lemma"
  )

X[1:10, 1:8]
## 10 x 8 sparse Matrix of class "dgCMatrix"
##          i_PRON it_PRON you_PRON good_ADJ make_VERB they_PRON so_ADV
## doc00001      4       3        .        .         .         .      .
## doc00002      4       4        .        2         .         1      1
## doc00003      .       1        2        1         1         1      .
## doc00004      2       1        .        1         .         .      .
## doc00005      2       1        .        1         .         .      .
## doc00006      2       .        .        .         .         .      .
## doc00007      3       3        1        .         .         .      .
## doc00008      .       4        .        .         1         1      2
## doc00009      1       2        .        3         .         .      .
## doc00010      6       .        .        2         .         5      .
##          will_VERB
## doc00001         .
## doc00002         .
## doc00003         1
## doc00004         .
## doc00005         .
## doc00006         .
## doc00007         .
## doc00008         .
## doc00009         .
## doc00010         1

I rarely find that this makes a significant difference in the predictiveness of the model, but it does help interpret the coefficients by understanding the correct meaning of the lemma. For example, above we know that “will” is the modal verb (“I will be late”) and not the noun (“He had the will to carry on”).

Scaling the Counts

Another way to modify the features in your data is to change how the term frequencies are scaled. By default, we just use raw counts, but can change the option tf_weight to “binary” to convert every word into a 0/1 indicator variable:

X <- token %>%
  filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB")) %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "lemma",
    tf_weight = "binary"
  )

X[1:10, 1:15]
## 10 x 15 sparse Matrix of class "dgCMatrix"
##          good make so can will book well very just more great when time love
## doc00001    .    .  .   .    .    .    .    .    .    1     .    .    .    .
## doc00002    1    .  1   .    .    1    .    1    .    1     1    1    1    .
## doc00003    1    1  .   .    1    1    .    1    1    1     1    .    1    .
## doc00004    1    .  .   1    .    .    1    .    .    .     .    .    1    .
## doc00005    1    .  .   .    .    1    .    .    .    .     .    1    .    .
## doc00006    .    .  .   .    .    .    .    .    .    .     .    .    1    .
## doc00007    .    .  .   .    .    1    .    .    .    .     1    .    .    .
## doc00008    .    1  1   .    .    .    1    .    1    1     .    .    .    .
## doc00009    1    .  .   .    .    1    .    .    .    .     1    .    .    .
## doc00010    1    .  .   1    1    .    1    1    .    1     .    .    .    .
##          other
## doc00001     .
## doc00002     1
## doc00003     .
## doc00004     .
## doc00005     .
## doc00006     .
## doc00007     .
## doc00008     .
## doc00009     .
## doc00010     .

This can be useful to compare texts of different lengths and to avoid focusing the model on very common words (without the need to a hard cut-off).

Manual Features from Tokens

Another way to modify the available features for analysis is to construct a set of individual features from the token table. For example, we can compute the type token ratio (number of unique words divided by the number of all words), the number of sentences, and the number of words in each document. I will also compute the frequency of the lemma “that” just to show how this can be compared to the TF matrix.

token_summary <- token %>% 
  group_by(doc_id) %>%
  filter(!is.na(lemma)) %>%
  summarize(
    type_token_ratio = length(unique(lemma)) / n(),
    num_sentences = max(sid),
    num_tokens = n(),
    num_word_that = sum(lemma == "that")
  ) 

One thing that is nice about this data is that we can visualize it on its own, without needing a predictive model:

amazon %>%
  inner_join(token_summary, by = "doc_id") %>%
  group_by(category) %>%
  summarize(sm_mean_ci_normal(type_token_ratio)) %>%
  arrange(desc(type_token_ratio_mean)) %>%
  ggplot() +
    geom_pointrange(aes(
      x = category,
      y = type_token_ratio_mean,
      ymin = type_token_ratio_ci_min,
      ymax = type_token_ratio_ci_max
    ))

Of course, we can also put this into a model matrix:

X_cov <- amazon %>%
  select(doc_id, category) %>%
  left_join(token_summary, by = "doc_id") %>%
  model.frame(category ~ type_token_ratio + num_sentences + num_tokens -1, data = .) %>%
  model.matrix(attr(., "terms"), .)

head(X_cov)
##   type_token_ratio num_sentences num_tokens
## 1        0.7500000             6         80
## 2        0.4888337            25        403
## 3        0.6122449             5        147
## 4        0.5565611             8        221
## 5        0.7131148             7        122
## 6        0.8214286             2         28

And use in alone or along-side other predictors.

Word Frequency

The final new feature type we will see today uses a new dataset that is included anytime you load the cleanNLP package. It is called word_frequency, and give the frequency of different words over a large corpus of website texts:

word_frequency
## # A tibble: 150,000 x 3
##    language word  frequency
##    <chr>    <chr>     <dbl>
##  1 en       the       3.93 
##  2 en       of        2.24 
##  3 en       and       2.21 
##  4 en       to        2.06 
##  5 en       a         1.54 
##  6 en       in        1.44 
##  7 en       for       1.01 
##  8 en       is        0.800
##  9 en       on        0.638
## 10 en       that      0.578
## # … with 149,990 more rows

We can use this to remove very common words from our TF matrix as follows:

X <- token %>%
  mutate(lemma = if_else(upos == "PRON", stri_trans_tolower(token), lemma)) %>%
  filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB", "PRON")) %>%
  semi_join(
    filter(word_frequency, frequency < 0.001), by = c("lemma" = "word")
  ) %>%
  cnlp_utils_tf(
    doc_set = amazon$doc_id,
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "doc_id",
    token_var = "lemma"
  )

X[1:10, 1:11]
## 10 x 11 sparse Matrix of class "dgCMatrix"
##          snack tasty enjoyable ingredient remind fascinating marry portray
## doc00001     .     .         .          .      .           .     .       .
## doc00002     .     .         .          .      .           .     .       .
## doc00003     .     .         .          .      .           .     .       .
## doc00004     .     .         .          .      .           .     .       .
## doc00005     .     .         .          .      .           .     .       .
## doc00006     .     .         .          .      .           .     .       .
## doc00007     .     .         .          .      .           .     .       .
## doc00008     .     .         .          .      .           .     .       .
## doc00009     .     .         .          .      .           .     .       .
## doc00010     .     .         .          .      .           .     .       .
##          calorie disappoint gluten
## doc00001       .          .      .
## doc00002       .          .      .
## doc00003       .          .      .
## doc00004       .          .      .
## doc00005       .          .      .
## doc00006       .          .      .
## doc00007       .          .      .
## doc00008       .          .      .
## doc00009       .          .      .
## doc00010       .          .      .

Another, somewhat more interesting, usage of word frequencies is to use them as a meta-feature in understanding word usage within different documents. To start, let’s break the word frequencies into 10 buckets according to their overall frequency (so in theory, each bucket will occur in roughly the same proportion):

ngroups <- 10
freq_b <- word_frequency %>%
  mutate(cum_freq = cumsum(frequency)) %>%
  mutate(bucket = cut(
    cum_freq,
    breaks = seq(0, 100, length.out = ngroups + 1),
    labels = FALSE,
    include.lowest = TRUE
  ))
freq_b
## # A tibble: 150,000 x 5
##    language word  frequency cum_freq bucket
##    <chr>    <chr>     <dbl>    <dbl>  <int>
##  1 en       the       3.93      3.93      1
##  2 en       of        2.24      6.17      1
##  3 en       and       2.21      8.38      1
##  4 en       to        2.06     10.4       2
##  5 en       a         1.54     12.0       2
##  6 en       in        1.44     13.4       2
##  7 en       for       1.01     14.4       2
##  8 en       is        0.800    15.2       2
##  9 en       on        0.638    15.9       2
## 10 en       that      0.578    16.5       2
## # … with 149,990 more rows

Then, we can assign each lemma to one of the frequencies and count how often each bucket is used in each document:

token_summary <- token %>% 
  inner_join(freq_b, by = c("lemma" = "word")) %>%
  group_by(doc_id, bucket) %>%
  summarize(sm_count()) %>%
  mutate(count = count / sum(count) * 100) %>%
  ungroup() %>%
  pivot_wider(
    names_from = "bucket",
    names_prefix = "bucket_",
    values_from = "count",
    names_sort = TRUE,
    values_fill = 0
  )

token_summary
## # A tibble: 8,823 x 11
##    doc_id bucket_1 bucket_2 bucket_3 bucket_4 bucket_5 bucket_6 bucket_7
##    <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 doc00…    11.8     11.8      20.6    20.6      5.88     5.88    11.8 
##  2 doc00…    12.5     13.9      17.5    11.1      5.71    10.4      7.5 
##  3 doc00…     7.76    19.0      18.1    12.9      6.03    13.8      6.90
##  4 doc00…    13.4     15.1      16.3     6.40     4.07    11.0     10.5 
##  5 doc00…    12       14.       12      12        7.      12       13   
##  6 doc00…    13.0     13.0      13.0     8.70     8.70    26.1      4.35
##  7 doc00…     7.58    13.6      21.2    10.6     12.1      3.03    12.1 
##  8 doc00…    13.2     12.1      18.2     9.12     8.82    10.6      8.82
##  9 doc00…    13.7     15.7      13.7    19.6      1.96    13.7      5.88
## 10 doc00…     9.92     9.09     20.7    14.9      6.61     6.61     6.61
## # … with 8,813 more rows, and 3 more variables: bucket_8 <dbl>,
## #   bucket_9 <dbl>, bucket_10 <dbl>

And then this can be used as a set of features:

X_cov <- amazon %>%
  select(doc_id, category) %>%
  left_join(token_summary, by = "doc_id") %>%
  select(-doc_id) %>%
  model.frame(category ~ . -1, data = .) %>%
  model.matrix(attr(., "terms"), .)

X_cov[1:10, 1:6]
##     bucket_1  bucket_2 bucket_3  bucket_4  bucket_5  bucket_6
## 1  11.764706 11.764706 20.58824 20.588235  5.882353  5.882353
## 2  12.500000 13.928571 17.50000 11.071429  5.714286 10.357143
## 3   7.758621 18.965517 18.10345 12.931034  6.034483 13.793103
## 4  13.372093 15.116279 16.27907  6.395349  4.069767 11.046512
## 5  12.000000 14.000000 12.00000 12.000000  7.000000 12.000000
## 6  13.043478 13.043478 13.04348  8.695652  8.695652 26.086957
## 7   7.575758 13.636364 21.21212 10.606061 12.121212  3.030303
## 8  13.235294 12.058824 18.23529  9.117647  8.823529 10.588235
## 9  13.725490 15.686275 13.72549 19.607843  1.960784 13.725490
## 10  9.917355  9.090909 20.66116 14.876033  6.611570  6.611570

You can change the number of buckets, or collapse some of the most frequent ones. These are often useful features to use when looking at writing style.