library(tidyverse)
library(forcats)
library(ggrepel)
library(smodels)
library(cleanNLP)
library(glmnet)
theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)
In these notes we will look a data set of product reviews from Amazon. Reviews come from one of three different categories (Books, Films, and Food); the classification task we will investigate is the classification of items into one of these three categories.
set.seed(1)
<- read_csv("data/amazon_product_class.csv") %>%
amazon mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
amazon
## # A tibble: 8,823 x 4
## doc_id category text train_id
## <chr> <chr> <chr> <chr>
## 1 doc00001 food "At some point I would like to try TeffBob's R… valid
## 2 doc00002 book "John Crowley wrote \"Little, Big\", in the w… valid
## 3 doc00003 book "The New York Post is often credited with popu… train
## 4 doc00004 film "INTO THIN AIR, based on Jon Krakauer's best s… train
## 5 doc00005 film "When the Wind Blows was based upon an English… train
## 6 doc00006 food "I have sent this basket a number of times to … valid
## 7 doc00007 book "If you enjoy history, this book is an selecti… train
## 8 doc00008 film "Though it holds up surprisingly well over thi… valid
## 9 doc00009 book "Agatha's written is amazing. The whole story … train
## 10 doc00010 food "Pecans are my favorite nut, alas, they are fa… valid
## # … with 8,813 more rows
We will use a more sophisticated method for producing TF matrices and see how to adapt our text classification methods for when there are more than two categories.
As we did in the previous notes, we will use the cleanNLP package to split the textual data into a format with one row for each token (word or punctuation mark). Last time we used the stringi backend, which is fast but error-prone. This time we will use a library called spacy for extracting linguistic features from the text. Running this backend can be done with the following code:
cnlp_init_spacy("en_core_web_sm")
<- cnlp_annotate(amazon)$token token
However, it requires having the Python library spacy already installed and set up on your machine. It also can take a few minutes to finish parsing the text. As an alternative, we will just load in the pre-computed data here (and I will provide similar parsed data for the lab and projects):
<- read_csv("data/amazon_product_class_token.csv.gz")
token token
## # A tibble: 1,548,904 x 10
## doc_id sid tid token token_with_ws lemma upos xpos tid_source
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 doc00… 1 1 At At at ADP IN 6
## 2 doc00… 1 2 some some some DET DT 3
## 3 doc00… 1 3 point point point NOUN NN 1
## 4 doc00… 1 4 I I -PRO… PRON PRP 6
## 5 doc00… 1 5 would would would VERB MD 6
## 6 doc00… 1 6 like like like VERB VB 0
## 7 doc00… 1 7 to to to PART TO 8
## 8 doc00… 1 8 try try try VERB VB 6
## 9 doc00… 1 9 Teff… TeffBob Teff… PROPN NNP 15
## 10 doc00… 1 10 's 's 's PART POS 9
## # … with 1,548,894 more rows, and 1 more variable: relation <chr>
There is a lot of information that has been automatically added to this table, the collective results of decades of research in computational linguistics and natural language processing. Each row corresponds to a word or a punctuation mark, along with metadata describing the token. Notice that reading down the column token
reproduces the original text. The columns available are:
There are many analyses that can be performed on the extracted features in this table. We will look at a few here and expand on them over the next few weeks.
Before delving into the new variables in the tokens table, let’s start by replicating the analysis we did last time with the spam data for the Amazon product category. Because this is a larger data set, we will set some additional parameters to the cnlp_utils_tf
function to include only those terms that are in at least 0.1% of the corpus and no more than 50% of the corpus. We can also set the maximum number of terms that will be included.
<- token %>%
X cnlp_utils_tf(doc_set = amazon$doc_id,
min_df = 0.001,
max_df = 0.5,
max_features = 10000)
dim(X)
## [1] 8823 6871
If there are more that max_features
terms that fall between the allowed frequencies, the most frequent of these will be used. As before, we will create training data and run the glmnet function.
<- X[amazon$train_id == "train", ]
X_train <- amazon$category[amazon$train_id == "train"]
y_train
<- cv.glmnet(
model alpha = 0.2, family = "multinomial", nfolds = 3
X_train, y_train, )
Like the spam model, our penalized regression does a very good job of classifying products. The training set is nearly 99% accurate and the validation set is 96.5% accurate.
%>%
amazon mutate(pred = predict(model, newx = X, type = "class")) %>%
group_by(train_id) %>%
summarize(class_rate = mean(category == pred))
## # A tibble: 2 x 2
## train_id class_rate
## <chr> <dbl>
## 1 train 0.989
## 2 valid 0.965
Looking at a confusion matrix, we see that the few mistakes that do occur happen when books and films are confused with one another. Can you think of why this might happen?
%>%
amazon mutate(pred = predict(model, newx = X, type = "class")) %>%
select(category, pred, train_id) %>%
table()
## , , train_id = train
##
## pred
## category book film food
## book 1745 19 13
## film 14 1762 11
## food 0 3 1751
##
## , , train_id = valid
##
## pred
## category book film food
## book 1114 40 21
## film 44 1091 13
## food 1 3 1178
Notice that each category has its own coefficients.
<- coef(model, s = model$lambda[22])
temp <- Reduce(cbind, temp)
beta <- beta[apply(beta != 0, 1, any),]
beta colnames(beta) <- names(temp)
beta
## 12 x 3 sparse Matrix of class "dgCMatrix"
## book film food
## (Intercept) -0.055797141 0.02218407 0.033613070
## book 0.135159825 . -0.001795063
## read 0.175900860 . .
## story . . -0.001530181
## movie . 0.11408253 .
## film . 0.05474195 .
## taste . . 0.226705692
## watch . 0.10448992 .
## write 0.005048023 . .
## flavor . . 0.157245530
## product . . 0.046148874
## dvd . 0.04153220 .
You should see that the words that come out of the model match our intuition for what words would be associated with each product type.