For today’s notes we are going to look again at the Amazon product
classification task. We will read in the docs
and
anno
tables:
<- read_csv("../data/amazon_product_class.csv")
docs <- read_csv("../data/amazon_product_class_token.csv.gz") anno
As a reminder from last time, the goal is to predict which product category a review of form based on a the user product review text. Today, we will focus on the ways that we can influence the construction of the elastic net model.
By far the most important way that we can modify the model is by selecting the features that will be used in the model. By default, the model uses the lemma column to build features. It creates features from the 10,000 most frequent terms that are in at least 0.1% of the documents. Here is the function to build the elastic net model with all of those default made explicit:
<- dsst_enet_build(anno, docs,
model min_df = 0.001,
max_df = 1.0,
max_features = 10000)
We can change the model by modifying these value. For example,
setting max_df
to 0.5
will not include lemmas
that are in more than half of the reviews. This can be useful if the
model is including too many stylistic words, such as pronouns, that we
do not want as part of our analysis. There is no clear rule about how to
set these values. Experimentation is key and you will practice this in
the class notes for today.
Another way to change the predictive model that is being created is
through filtering the annotations table before passing it to the
dsst_enet_build
function. This requires that we write a
little bit of our own R code. One of the most common ways we will want
to filter the data is through the column upos
, the Universal POS
tags.
We can use the upos
tags to select only certain parts of
speech. This can be useful in a few ways, such as removing structural
tokens (i.e., punctuation and determinants), or focusing on stylistic
features such as adjectives in order to avoid content-specific terms.
Here is a typical usage, where we include only the nouns, verbs,
adjectives, adverbs, and pronouns:
<- anno %>%
model filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB", "PRON")) %>%
dsst_enet_build(docs)
Often you will find that you want to build multiple models with different parts of speech to understand a dataset.
Another way to modify the model is to transform the lemmas found in
the anno
table. A popular way to do this is through the use
of features called N-grams. These look at combinations
of adjacent tokens rather than just individual tokens. The
N refers to the number of tokens that are put
together.
The helper function dsst_ngram
will produce N-gram
features of various lengths. We can specify the maximum (n
)
and minimum (n_min
) length of the tokens. The output of
this can then we passed directly to dsst_enet_build
. For
example:
<- anno %>%
model dsst_ngram(n = 2, n_min = 2) %>%
dsst_enet_build(docs)
You can see the impact of the ngrams in the model coefficients:
dsst_coef(model$model, lambda_num = 30)
## 13 x 4 sparse Matrix of class "dgCMatrix"
## book film food MLN
## (Intercept) -0.15538849 0.006374181 0.149014308 .
## this book 0.79035526 . . 2
## this movie . 0.505010237 . 13
## the book 0.17900509 . . 18
## this film . 0.207687586 . 20
## book . 0.19341191 . . 20
## film be . 0.114304618 . 22
## the film . 0.078635352 . 23
## movie . . 0.172070238 . 24
## book be 0.08036933 . . 25
## to read 0.09108604 . . 28
## film . . 0.041923087 . 28
## of the . . -0.009114281 29
An extension of this approach are called skip-grams.
They work similarly but allow combinations of N words to skip up to K
terms. To implement these, we use the function
dsst_skip_gram
:
<- anno %>%
model dsst_skip_gram(n = 2, n_min = 2, k = 1) %>%
dsst_enet_build(docs)
It can be fun to experiment with different settings for these values, but keep in mind that setting the parameters too high will result in a very large dataset that may take a long time to finish running.
We can also use the upos
values to change the
annotations, rather than just filtering them. First of all, you will
have noticed that the annotations create a lemmatised version of
pronouns that maps every pronoun to the string “-PRON-”. You may want to
put these back into the lemmas, which we can do with the following
code:
<- anno %>%
model mutate(lemma = if_else(upos == "PRON", tolower(token), lemma)) %>%
mutate(lemma = if_else(lemma == "i", "I", lemma)) %>%
dsst_enet_build(docs)
We can also add the part of speech tag to the end of the lemma. This rarely changes the model much but helps interpret the coefficients:
<- anno %>%
model mutate(lemma = paste0(lemma, "_", upos)) %>%
dsst_enet_build(docs)
You can see the effect of this change here:
dsst_coef(model$model, lambda_num = 50)
## 16 x 4 sparse Matrix of class "dgCMatrix"
## book film food MLN
## (Intercept) -0.2160591 0.01776748 0.198291615 .
## book_NOUN 0.6925786 . . 2
## movie_NOUN . 0.56825903 . 10
## film_NOUN . 0.24714900 . 18
## flavor_NOUN . . 0.589105078 19
## read_VERB 0.4231783 . . 20
## taste_VERB . . 0.599092119 22
## taste_NOUN . . 0.596607517 22
## product_NOUN . . 0.299008650 32
## watch_VERB . 0.16460704 . 37
## dvd_NOUN . 0.20163264 . 39
## eat_VERB . . 0.133483895 41
## story_NOUN . . -0.067088477 43
## read_NOUN 0.1294661 . . 46
## the_DET . . -0.001927111 48
## delicious_ADJ . . 0.011247454 50
Some more advice about how to use these techniques are included below, but as always some trial and error is needed.
As a final tweak today, we can also use a completely different
variable in the anno
table to build the features by setting
the token_var
option. For example, we could use the
upos
tags themselves:
<- anno %>%
model dsst_enet_build(docs, token_var = "upos")
Which results in a very different type of model:
dsst_coef(model$model)
## 15 x 4 sparse Matrix of class "dgCMatrix"
## book film food MLN
## (Intercept) -0.248980100 -0.222040273 0.471020373 .
## DET . 0.010528716 -0.037635805 2
## ADP 0.020288105 . -0.006273853 11
## PROPN . 0.030423746 -0.024555573 14
## PART 0.037567756 . . 31
## VERB 0.008172884 . . 40
## INTJ -0.150151789 0.023595291 . 55
## SYM . -0.007301116 0.119941005 57
## X . -0.169178788 0.102577676 61
## NUM -0.024592991 . 0.097530599 62
## PRON . -0.012662239 0.023958537 63
## SPACE . 0.005668980 . 74
## AUX -0.008352086 . . 87
## ADV -0.003825879 . . 90
## ADJ . . 0.005461118 93
We will see models like this more as we turn to authorship predict tasks in the next few projects.
As a final tool for today, we can also go back and identify the use of particular terms within the documents using a method called keywords in context, or KWiC. The function works by passing it a term to search for as well as the docs table and the number of items to return. Here, for example, we can see the ways that the term “read” are used in the texts:
dsst_kwic(anno, terms = "read")
## [doc00076] riting style kept me |reading| , and the simple ide
## [doc00793] ast paced & visceral |read | .
## [doc00988] Having |read | that Tom Cruise was
## [doc01427] its really not a bad |read | , just disappointing
## [doc01904] Having |read | the first three book
## [doc02138] er and would like to |read | more of his works .
## [doc02340] Thank you for |reading| this
## [doc02782] business book I 've |read | sinceThe Culture Cod
## [doc02949] I always enjoy |reading| anything of Agatha C
## [doc03237] eries before i began |reading|
## [doc03535] th information after |reading| the first section ,
## [doc03742] guess I may as well |read | the damned thing , n
## [doc04005] me that I needed to |read | Aidan Donnelley Rowl
## [doc04551] You must |read | the whole set of boo
## [doc04818] It 's easy to |read | & filled to the brim
## [doc05262] I have |read | the book and recomme
## [doc05409] Excellent |read | ! !
## [doc05897] ter that I only kept |reading| in hopes that the he
## [doc06473] was easy and fun to |read | with her and for me
## [doc06903] book , while easy to |read | , just does n't alwa
This will be useful when we start working with datasets where it is not immediately clear why a particular term is being associated with a label type.
There is no one best way to apply these techniques; you will generally need to experiment and often use multiple models in an analysis. Note that it is not always the most predictive model that is the best to use for your analysis. We are not just (or even primarily) focused on making predictions.
I can offer a bit of general advice, though, as you work through the
next set of notes and the first project. I find that first filtering on
the five most common parts of speech and fixing the pronoun issue to be
a good first pass to my first model. I sometimes also tag with the part
of speech code for interpretability, but otherwise leave things as the
defaults. This is the model I most often use for negative examples,
maximum probability examples, and coefficient tables. Then, depending on
the application, I may try other techniques such as n-grams with
n=2
or n=3
with n_min=1
. If there
is a noticeable improvement I will try to understand why and otherwise
not bother with it much.
Using alternative features are useful for other tasks, but not so much for the ones we will see in the next few weeks. We will return to those with Project 2.