Today we make a shift towards a different type of analysis. So far we have been mostly interested in a large collection of texts with the goal of associating features in predicting some response variable. Now, we start to consider the case where we are interested in understanding specific documents in a corpus. We will serve two purposes today by introducing some new methods while also introducing the data format for the third project.

Yelp Dataset

The third project works with a collection of reviews from the website Yelp. It is similar to the Amazon produce reviews, but contains significantly more metadata and more authors. I have also a bit cleaner than the Amazon data. Each group has been given a different city of data. In the notes I will look at data from Toronto.

Let’s load the dataset into R and look at the full data table:

yelp <- read_csv("data/toronto.csv.gz")
token <- read_csv("data/toronto_token.csv.gz")

yelp
## # A tibble: 7,000 x 12
##    doc_id   train_id user_id   user_name gender gender_prob stars b_name     
##    <chr>    <chr>    <chr>     <chr>     <chr>        <dbl> <dbl> <chr>      
##  1 doc00001 valid    uid_0000… Kat       female           1     3 Other      
##  2 doc00002 valid    uid_0000… Kat       female           1     4 Ed's Real …
##  3 doc00003 valid    uid_0000… Kat       female           1     1 Burrito Bo…
##  4 doc00004 valid    uid_0000… Kat       female           1     4 Other      
##  5 doc00005 train    uid_0000… Kat       female           1     4 Other      
##  6 doc00006 valid    uid_0000… Kat       female           1     4 Other      
##  7 doc00007 valid    uid_0000… Kat       female           1     5 Other      
##  8 doc00008 valid    uid_0000… Kat       female           1     5 Other      
##  9 doc00009 train    uid_0000… Kat       female           1     3 Other      
## 10 doc00010 valid    uid_0000… Kat       female           1     3 Other      
## # … with 6,990 more rows, and 4 more variables: biz_category <chr>,
## #   lon <dbl>, lat <dbl>, text <chr>

You will see that, in addition to the user name and user id (note, these are uniquely defined; you can use either one), there is also other information. For example, we have a predicted gender for the reviewer, the number of stars the review was given, the business name, the business category, and the latitude and longitude of the business. Note that the variable b_name collapses less common businesses into an “Other” category.

The tokens data set contains the same columns as in our other datasets; I have also put several important variables directly in the tokens table to make them easy to work with.

token
## # A tibble: 1,405,488 x 14
##    doc_id   user_name gender biz_category b_name   sid   tid token 
##    <chr>    <chr>     <chr>  <chr>        <chr>  <dbl> <dbl> <chr> 
##  1 doc00001 Kat       female Other        Other      1     1 I     
##  2 doc00001 Kat       female Other        Other      1     2 have  
##  3 doc00001 Kat       female Other        Other      1     3 a     
##  4 doc00001 Kat       female Other        Other      1     4 good  
##  5 doc00001 Kat       female Other        Other      1     5 friend
##  6 doc00001 Kat       female Other        Other      1     6 who   
##  7 doc00001 Kat       female Other        Other      1     7 spends
##  8 doc00001 Kat       female Other        Other      1     8 $     
##  9 doc00001 Kat       female Other        Other      1     9 50    
## 10 doc00001 Kat       female Other        Other      1    10 a     
## # … with 1,405,478 more rows, and 6 more variables: token_with_ws <chr>,
## #   lemma <chr>, upos <chr>, xpos <chr>, tid_source <dbl>, relation <chr>

You may already have some ideas about what we might do with this dataset. Perhaps predicting the number of stars in the review? Or the user name? We could even predict the gender of the reviewer or the category of the business. And yes, you are welcome to do all of these things.

Today, however, I want to also show another entirely different set of techniques. Rather than treating these covariates as supervising variables that we need to predict, we will collapse all of the reviews with a common indentifier and use techniques to understand the structure of each of these meta-documents. So, to summarize, for us:

In the notes, I will work with the business categories. For the project, you will also look at the other variables.

An Illustration

To illustrate the techniques that we are going to see, let’s do a quick illustration. We will call the function cnlp_utils_tfidf() using the variable biz_category in place of doc_id to create a matrix with just one row for each business category:

X <- token %>%
  cnlp_utils_tf(
    doc_set = unique(yelp$biz_category),
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "biz_category",
    token_var = "token"
  )

dim(X)
## [1]    35 10000

Now, let’s plot two of the columns, the two corresponding to the terms ‘Asian’ and ‘beer’:

# NOTE: This is just for illustration; probably do not want to use it
tibble(
  biz_category = rownames(X),
  asian = as.numeric(X[,"Asian"]),
  beer = as.numeric(X[,"beer"])
) %>%
  filter(biz_category != "Other") %>%
  ggplot(aes(asian, beer)) +
    geom_point(color = "grey85") +
    geom_text_repel(aes(label = biz_category), max.overlaps = 40) +
    scale_x_continuous(limits = c(-1, NA)) +
    scale_y_continuous(limits = c(-1, NA))

Notice how, even with just these two terms, we can begin to see the relationships between each of the business types. Do you see some patterns here that follow your intuition about these businesses?

Of course, we will eventually want to understand the data using all of the columns of X. That’s where we need to learn a few new techniques.

TF-IDF

In this course we have done a lot of work using the term frequency (TF) matrix. As we move into unsupervised learning, we will see that it is important to modify this object to scale the entries to account for the overall frequency of terms across the corpus. To do this, we will use a TF-IDF (term frequency-inverse document frequency) matrix. Mathematically, if tf are the number of times a term is used in a document, df are the number of documents that use the term at least once, and N are the total number of document, the TF-IDF score can be computed as:

\[ \text{tfidf} = (1 + log_2(\text{tf})) \times log_2(\text{N} / \text{df}) \]

The score gives a measurement of how important a term is in describing a document in the context of the other documents. Note that this is a popular choice for the scaling functions, but they are not universal and other choices are possible. We can create the TF-IDF matrix by replacing the normal function with cnlp_utils_tfidf.

In addition to its other uses, which we will see below, we can also TF-IDF to try to measure the most important words in each document by finding the terms that have the highest score. For this one particular case, we will use the function sm_text_tfidf (it returns a data frame with three columns in a ‘long’ format, rather than a matrix). Let’s see how it works in this case:

token %>%
  sm_text_tfidf(doc_var = "biz_category", token_var = "lemma") %>%
  group_by(doc_id) %>%
  arrange(desc(tfidf)) %>%
  slice_head(n = 5) %>%
  summarize(tokens = paste(token, collapse = "; ")) %>%
  print.data.frame()
##                       doc_id                                       tokens
## 1               Asian Fusion      sizzle; sizzling; katsu; donburi; Katsu
## 2                   Bakeries   Tetsu; croissant; cupcake; macaron; raisin
## 3                    Buffets         buffet; Mandarin; Paneer; AYCE; ipad
## 4                    Burgers Priest; Chuck; cheeseburger; Burger; Burgers
## 5                      Cafes           cat; Bud; Passport; barista; Toast
## 6             Canadian (New)         Jack; calabrese; Kalbi; Sister; Gord
## 7                  Caribbean             jerk; roti; oxtail; Jerk; island
## 8                    Chinese        chive; Mein; congee; Shanghai; turnip
## 9                     Cinema      theatre; cinema; cineplex; theater; vip
## 10              Coffee & Tea  starbuck; Second; barista; americano; Latte
## 11              Comfort Food    Swiss; Chalet; housemade; cornbread; Earl
## 12                  Desserts macaron; pistachio; cupcake; durian; souffle
## 13                   Dim Sum             Dim; siu; Shanghai; cart; dumple
## 14                    Diners      hash; Sunset; Philly; Hollandaise; Mars
## 15                    French              frite; Le; escargot; Bass; foie
## 16                Gastropubs           pub; Ale; IPA; Specials; flatbread
## 17                     Greek       souvlaki; greek; tzatziki; Greek; feta
## 18                   Grocery            Mart; Frills; aisle; T&T; organic
## 19 Ice Cream & Frozen Yogurt     cone; Gelato; gelato; Jesus; marshmallow
## 20                    Indian          samosa; paneer; Paneer; Hakka; roti
## 21                   Italian gnocchi; pepperoni; veal; linguine; Pizzeria
## 22                  Japanese            katsu; karaage; izakaya; sushi; J
## 23                    Korean       bibimbap; Korean; kalbi; Korea; kimchi
## 24                   Mexican       burrito; Taco; mexican; Electric; Bell
## 25                   Noodles                  Pho; Sukho; Khao; pho; Deer
## 26                     Other         shower; aisle; desk; animal; macaron
## 27                 Pakistani           samosa; naan; Biryani; -the; India
## 28                      Pubs              pub; Pub; pool; Firkin; rooftop
## 29                     Ramen           Ramen; raman; Kinton; Guu; izakaya
## 30               Steakhouses            Keg; Moxies; filet; mignon; Creek
## 31                Sushi Bars   Sushi; Sashimi; teppanyaki; sashimi; aburi
## 32                 Taiwanese     taiwanese; tapioca; grass; stinky; jelly
## 33                      Thai                   Pad; Thai; thai; ayce; Tom
## 34                Vietnamese         Pho; pho; vietnamese; vermicelli; mi
## 35                 Wine Bars           Keg; veal; Hoof; Veal; charcuterie

And again, how well does this match your intuition?

Distances

One way to understand the structure of our data in the high-dimensional space of the TF-IDF matrix is to compute the distance between documents. We could do this with typical Euclidean distances. However, this would by heavily biased based on the number of words in each document. A better approach is to look at the angle between two documents. Look at the two word example above to get some intuition for why.

We can compute angle distances that with the function sm_tidy_angle_distance():

token %>%
  cnlp_utils_tfidf(doc_var = "biz_category", token_var = "lemma") %>%
  sm_tidy_angle_distance()
## # A tibble: 1,225 x 3
##    document1                 document2 distance
##    <chr>                     <chr>        <dbl>
##  1 Other                     Other        0    
##  2 Ice Cream & Frozen Yogurt Other        0.422
##  3 Diners                    Other        0.399
##  4 Bakeries                  Other        0.394
##  5 Wine Bars                 Other        0.385
##  6 Caribbean                 Other        0.451
##  7 Pubs                      Other        0.397
##  8 French                    Other        0.401
##  9 Grocery                   Other        0.377
## 10 Italian                   Other        0.379
## # … with 1,215 more rows

After removing self-similarities, let’s see what the closest neighbor is to each document:

token %>%
  cnlp_utils_tfidf(doc_var = "biz_category", token_var = "lemma") %>%
  sm_tidy_angle_distance() %>%
  filter(document1 < document2) %>%
  group_by(document1) %>%
  arrange(distance) %>%
  slice_head(n = 1) %>%
  ungroup() %>%
  arrange(distance)
## # A tibble: 34 x 3
##    document1      document2    distance
##    <chr>          <chr>           <dbl>
##  1 Japanese       Sushi Bars      0.266
##  2 Indian         Pakistani       0.267
##  3 Cafes          Coffee & Tea    0.307
##  4 Noodles        Vietnamese      0.317
##  5 Canadian (New) Other           0.322
##  6 Comfort Food   Other           0.353
##  7 Bakeries       Desserts        0.360
##  8 Italian        Wine Bars       0.364
##  9 Gastropubs     Pubs            0.365
## 10 Chinese        Dim Sum         0.366
## # … with 24 more rows

Do you see some relationships that suggest this approach is reasonable?

Principal component analysis (PCA)

Principal component analysis is a common method for taking a high-dimensional data set and converting it into a smaller set of dimensions that capture many of the most interesting aspects of the higher dimensional space. The first principal components is defined as a direction in the high-dimensional space that captures the most variation in the inputs. The second component is a dimension perpendicular to the first that captures the highest amount of residual variance. Additional components are defined similarly.

If you prefer a mathematical definition, define the following vector (called the loading vector) of the data matrix X:

\[ W_1 = \text{argmax}_{\; v : || v ||_2 = 1} \left\{ || X v ||_2 \right\} \]

Then, the first principal component is given by:

\[ Z_1 = X \cdot W_1 \] The second loading vector (W2) is defined just as W1, but with the argmax taken over all unit vectors perpendicular to W1. And so on.

We can compute principal components using the helper function sm_tidy_pca. We will grab the first 4 components here:

token %>%
  cnlp_utils_tfidf(
    doc_set = unique(yelp$biz_category),
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "biz_category",
    token_var = "token"
  ) %>%
  sm_tidy_pca(n = 4)
## # A tibble: 35 x 5
##    document                      v1      v2      v3      v4
##    <chr>                      <dbl>   <dbl>   <dbl>   <dbl>
##  1 Other                     0.290  -0.352  -0.0281  0.0218
##  2 Ice Cream & Frozen Yogurt 0.0510 -0.121  -0.126  -0.114 
##  3 Diners                    0.0917 -0.169   0.0155  0.111 
##  4 Bakeries                  0.0752 -0.210  -0.262  -0.238 
##  5 Wine Bars                 0.113  -0.198   0.135   0.0640
##  6 Caribbean                 0.0241 -0.0189  0.101  -0.0321
##  7 Pubs                      0.0648 -0.223   0.137   0.188 
##  8 French                    0.0870 -0.161   0.0404  0.0353
##  9 Grocery                   0.0790 -0.113  -0.0791 -0.130 
## 10 Italian                   0.0969 -0.190   0.0687  0.0703
## # … with 25 more rows

What can we do with these components? Well, for one thing, we can plot the first two components to show the relationship between our documents within the high dimensional space:

token %>%
  cnlp_utils_tf(
    doc_set = unique(yelp$biz_category),
    min_df = 0.001,
    max_df = 1.0,
    max_features = 10000,
    doc_var = "biz_category",
    token_var = "token"
  ) %>%
  sm_tidy_pca() %>%
  ggplot(aes(x = v1, y = v2)) +
    geom_point(color = "grey90") +
    geom_text_repel(
      aes(label = document),
      show.legend = FALSE
    ) +
    theme_void()