Today we make a shift towards a different type of analysis. So far we have been mostly interested in a large collection of texts with the goal of associating features in predicting some response variable. Now, we start to consider the case where we are interested in understanding specific documents in a corpus. We will serve two purposes today by introducing some new methods while also introducing the data format for the third project.
The third project works with a collection of reviews from the website Yelp. It is similar to the Amazon produce reviews, but contains significantly more metadata and more authors. I have also a bit cleaner than the Amazon data. Each group has been given a different city of data. In the notes I will look at data from Toronto.
Let’s load the dataset into R and look at the full data table:
<- read_csv("data/toronto.csv.gz")
yelp <- read_csv("data/toronto_token.csv.gz")
token
yelp
## # A tibble: 7,000 x 12
## doc_id train_id user_id user_name gender gender_prob stars b_name
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 doc00001 valid uid_0000… Kat female 1 3 Other
## 2 doc00002 valid uid_0000… Kat female 1 4 Ed's Real …
## 3 doc00003 valid uid_0000… Kat female 1 1 Burrito Bo…
## 4 doc00004 valid uid_0000… Kat female 1 4 Other
## 5 doc00005 train uid_0000… Kat female 1 4 Other
## 6 doc00006 valid uid_0000… Kat female 1 4 Other
## 7 doc00007 valid uid_0000… Kat female 1 5 Other
## 8 doc00008 valid uid_0000… Kat female 1 5 Other
## 9 doc00009 train uid_0000… Kat female 1 3 Other
## 10 doc00010 valid uid_0000… Kat female 1 3 Other
## # … with 6,990 more rows, and 4 more variables: biz_category <chr>,
## # lon <dbl>, lat <dbl>, text <chr>
You will see that, in addition to the user name and user id (note, these are uniquely defined; you can use either one), there is also other information. For example, we have a predicted gender for the reviewer, the number of stars the review was given, the business name, the business category, and the latitude and longitude of the business. Note that the variable b_name
collapses less common businesses into an “Other” category.
The tokens data set contains the same columns as in our other datasets; I have also put several important variables directly in the tokens table to make them easy to work with.
token
## # A tibble: 1,405,488 x 14
## doc_id user_name gender biz_category b_name sid tid token
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 doc00001 Kat female Other Other 1 1 I
## 2 doc00001 Kat female Other Other 1 2 have
## 3 doc00001 Kat female Other Other 1 3 a
## 4 doc00001 Kat female Other Other 1 4 good
## 5 doc00001 Kat female Other Other 1 5 friend
## 6 doc00001 Kat female Other Other 1 6 who
## 7 doc00001 Kat female Other Other 1 7 spends
## 8 doc00001 Kat female Other Other 1 8 $
## 9 doc00001 Kat female Other Other 1 9 50
## 10 doc00001 Kat female Other Other 1 10 a
## # … with 1,405,478 more rows, and 6 more variables: token_with_ws <chr>,
## # lemma <chr>, upos <chr>, xpos <chr>, tid_source <dbl>, relation <chr>
You may already have some ideas about what we might do with this dataset. Perhaps predicting the number of stars in the review? Or the user name? We could even predict the gender of the reviewer or the category of the business. And yes, you are welcome to do all of these things.
Today, however, I want to also show another entirely different set of techniques. Rather than treating these covariates as supervising variables that we need to predict, we will collapse all of the reviews with a common indentifier and use techniques to understand the structure of each of these meta-documents. So, to summarize, for us:
In the notes, I will work with the business categories. For the project, you will also look at the other variables.
To illustrate the techniques that we are going to see, let’s do a quick illustration. We will call the function cnlp_utils_tfidf()
using the variable biz_category
in place of doc_id
to create a matrix with just one row for each business category:
<- token %>%
X cnlp_utils_tf(
doc_set = unique(yelp$biz_category),
min_df = 0.001,
max_df = 1.0,
max_features = 10000,
doc_var = "biz_category",
token_var = "token"
)
dim(X)
## [1] 35 10000
Now, let’s plot two of the columns, the two corresponding to the terms ‘Asian’ and ‘beer’:
# NOTE: This is just for illustration; probably do not want to use it
tibble(
biz_category = rownames(X),
asian = as.numeric(X[,"Asian"]),
beer = as.numeric(X[,"beer"])
%>%
) filter(biz_category != "Other") %>%
ggplot(aes(asian, beer)) +
geom_point(color = "grey85") +
geom_text_repel(aes(label = biz_category), max.overlaps = 40) +
scale_x_continuous(limits = c(-1, NA)) +
scale_y_continuous(limits = c(-1, NA))
Notice how, even with just these two terms, we can begin to see the relationships between each of the business types. Do you see some patterns here that follow your intuition about these businesses?
Of course, we will eventually want to understand the data using all of the columns of X. That’s where we need to learn a few new techniques.
In this course we have done a lot of work using the term frequency (TF) matrix. As we move into unsupervised learning, we will see that it is important to modify this object to scale the entries to account for the overall frequency of terms across the corpus. To do this, we will use a TF-IDF (term frequency-inverse document frequency) matrix. Mathematically, if tf
are the number of times a term is used in a document, df
are the number of documents that use the term at least once, and N
are the total number of document, the TF-IDF score can be computed as:
\[ \text{tfidf} = (1 + log_2(\text{tf})) \times log_2(\text{N} / \text{df}) \]
The score gives a measurement of how important a term is in describing a document in the context of the other documents. Note that this is a popular choice for the scaling functions, but they are not universal and other choices are possible. We can create the TF-IDF matrix by replacing the normal function with cnlp_utils_tfidf
.
In addition to its other uses, which we will see below, we can also TF-IDF to try to measure the most important words in each document by finding the terms that have the highest score. For this one particular case, we will use the function sm_text_tfidf
(it returns a data frame with three columns in a ‘long’ format, rather than a matrix). Let’s see how it works in this case:
%>%
token sm_text_tfidf(doc_var = "biz_category", token_var = "lemma") %>%
group_by(doc_id) %>%
arrange(desc(tfidf)) %>%
slice_head(n = 5) %>%
summarize(tokens = paste(token, collapse = "; ")) %>%
print.data.frame()
## doc_id tokens
## 1 Asian Fusion sizzle; sizzling; katsu; donburi; Katsu
## 2 Bakeries Tetsu; croissant; cupcake; macaron; raisin
## 3 Buffets buffet; Mandarin; Paneer; AYCE; ipad
## 4 Burgers Priest; Chuck; cheeseburger; Burger; Burgers
## 5 Cafes cat; Bud; Passport; barista; Toast
## 6 Canadian (New) Jack; calabrese; Kalbi; Sister; Gord
## 7 Caribbean jerk; roti; oxtail; Jerk; island
## 8 Chinese chive; Mein; congee; Shanghai; turnip
## 9 Cinema theatre; cinema; cineplex; theater; vip
## 10 Coffee & Tea starbuck; Second; barista; americano; Latte
## 11 Comfort Food Swiss; Chalet; housemade; cornbread; Earl
## 12 Desserts macaron; pistachio; cupcake; durian; souffle
## 13 Dim Sum Dim; siu; Shanghai; cart; dumple
## 14 Diners hash; Sunset; Philly; Hollandaise; Mars
## 15 French frite; Le; escargot; Bass; foie
## 16 Gastropubs pub; Ale; IPA; Specials; flatbread
## 17 Greek souvlaki; greek; tzatziki; Greek; feta
## 18 Grocery Mart; Frills; aisle; T&T; organic
## 19 Ice Cream & Frozen Yogurt cone; Gelato; gelato; Jesus; marshmallow
## 20 Indian samosa; paneer; Paneer; Hakka; roti
## 21 Italian gnocchi; pepperoni; veal; linguine; Pizzeria
## 22 Japanese katsu; karaage; izakaya; sushi; J
## 23 Korean bibimbap; Korean; kalbi; Korea; kimchi
## 24 Mexican burrito; Taco; mexican; Electric; Bell
## 25 Noodles Pho; Sukho; Khao; pho; Deer
## 26 Other shower; aisle; desk; animal; macaron
## 27 Pakistani samosa; naan; Biryani; -the; India
## 28 Pubs pub; Pub; pool; Firkin; rooftop
## 29 Ramen Ramen; raman; Kinton; Guu; izakaya
## 30 Steakhouses Keg; Moxies; filet; mignon; Creek
## 31 Sushi Bars Sushi; Sashimi; teppanyaki; sashimi; aburi
## 32 Taiwanese taiwanese; tapioca; grass; stinky; jelly
## 33 Thai Pad; Thai; thai; ayce; Tom
## 34 Vietnamese Pho; pho; vietnamese; vermicelli; mi
## 35 Wine Bars Keg; veal; Hoof; Veal; charcuterie
And again, how well does this match your intuition?
One way to understand the structure of our data in the high-dimensional space of the TF-IDF matrix is to compute the distance between documents. We could do this with typical Euclidean distances. However, this would by heavily biased based on the number of words in each document. A better approach is to look at the angle between two documents. Look at the two word example above to get some intuition for why.
We can compute angle distances that with the function sm_tidy_angle_distance()
:
%>%
token cnlp_utils_tfidf(doc_var = "biz_category", token_var = "lemma") %>%
sm_tidy_angle_distance()
## # A tibble: 1,225 x 3
## document1 document2 distance
## <chr> <chr> <dbl>
## 1 Other Other 0
## 2 Ice Cream & Frozen Yogurt Other 0.422
## 3 Diners Other 0.399
## 4 Bakeries Other 0.394
## 5 Wine Bars Other 0.385
## 6 Caribbean Other 0.451
## 7 Pubs Other 0.397
## 8 French Other 0.401
## 9 Grocery Other 0.377
## 10 Italian Other 0.379
## # … with 1,215 more rows
After removing self-similarities, let’s see what the closest neighbor is to each document:
%>%
token cnlp_utils_tfidf(doc_var = "biz_category", token_var = "lemma") %>%
sm_tidy_angle_distance() %>%
filter(document1 < document2) %>%
group_by(document1) %>%
arrange(distance) %>%
slice_head(n = 1) %>%
ungroup() %>%
arrange(distance)
## # A tibble: 34 x 3
## document1 document2 distance
## <chr> <chr> <dbl>
## 1 Japanese Sushi Bars 0.266
## 2 Indian Pakistani 0.267
## 3 Cafes Coffee & Tea 0.307
## 4 Noodles Vietnamese 0.317
## 5 Canadian (New) Other 0.322
## 6 Comfort Food Other 0.353
## 7 Bakeries Desserts 0.360
## 8 Italian Wine Bars 0.364
## 9 Gastropubs Pubs 0.365
## 10 Chinese Dim Sum 0.366
## # … with 24 more rows
Do you see some relationships that suggest this approach is reasonable?
Principal component analysis is a common method for taking a high-dimensional data set and converting it into a smaller set of dimensions that capture many of the most interesting aspects of the higher dimensional space. The first principal components is defined as a direction in the high-dimensional space that captures the most variation in the inputs. The second component is a dimension perpendicular to the first that captures the highest amount of residual variance. Additional components are defined similarly.
If you prefer a mathematical definition, define the following vector (called the loading vector) of the data matrix X:
\[ W_1 = \text{argmax}_{\; v : || v ||_2 = 1} \left\{ || X v ||_2 \right\} \]
Then, the first principal component is given by:
\[ Z_1 = X \cdot W_1 \] The second loading vector (W2) is defined just as W1, but with the argmax taken over all unit vectors perpendicular to W1. And so on.
We can compute principal components using the helper function sm_tidy_pca
. We will grab the first 4 components here:
%>%
token cnlp_utils_tfidf(
doc_set = unique(yelp$biz_category),
min_df = 0.001,
max_df = 1.0,
max_features = 10000,
doc_var = "biz_category",
token_var = "token"
%>%
) sm_tidy_pca(n = 4)
## # A tibble: 35 x 5
## document v1 v2 v3 v4
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Other 0.290 -0.352 -0.0281 0.0218
## 2 Ice Cream & Frozen Yogurt 0.0510 -0.121 -0.126 -0.114
## 3 Diners 0.0917 -0.169 0.0155 0.111
## 4 Bakeries 0.0752 -0.210 -0.262 -0.238
## 5 Wine Bars 0.113 -0.198 0.135 0.0640
## 6 Caribbean 0.0241 -0.0189 0.101 -0.0321
## 7 Pubs 0.0648 -0.223 0.137 0.188
## 8 French 0.0870 -0.161 0.0404 0.0353
## 9 Grocery 0.0790 -0.113 -0.0791 -0.130
## 10 Italian 0.0969 -0.190 0.0687 0.0703
## # … with 25 more rows
What can we do with these components? Well, for one thing, we can plot the first two components to show the relationship between our documents within the high dimensional space:
%>%
token cnlp_utils_tf(
doc_set = unique(yelp$biz_category),
min_df = 0.001,
max_df = 1.0,
max_features = 10000,
doc_var = "biz_category",
token_var = "token"
%>%
) sm_tidy_pca() %>%
ggplot(aes(x = v1, y = v2)) +
geom_point(color = "grey90") +
geom_text_repel(
aes(label = document),
show.legend = FALSE
+
) theme_void()