Load the Data

For today’s notes, I want to illustrate ways that we can work with the kinds of data you are looking at with the projects. Specifically, I will look at the Yelp reviews from Montréal (these have not been assigned to any group):

docs <- read_csv("../data/revs_montreal.csv.bz2")
anno <- read_csv("../data/revs_montreal_token.csv.bz2")

I’ll start by looking at a PCA plot of the reviews along with a default number of clusters:

anno %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "user_name", n_dims = 2) %>%
  dsst_kmeans(n_clusters = 5) %>%
  ggplot(aes(v1, v2, color = factor(cluster))) +
    geom_point() +
    theme_void()
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead

Notice that the data are very cluster into two distinct groups. Let’s redo the analysis with just two clusters. I will be using the results a few different ways, so I save versions of the docs and anno tables here.

docs_cluster <- anno %>%
  inner_join(select(docs, -text), by = "doc_id") %>%
  dsst_pca(doc_var = "user_name", n_dims = 2) %>%
  dsst_kmeans(n_clusters = 2) %>%
  mutate(train_id = "train") %>%
  mutate(doc_id = user_name)

anno_cluster <- anno %>%
  inner_join(select(docs, doc_id, user_name), by = "doc_id") %>%
  inner_join(docs_cluster, by = "user_name", suffix = c("_orig", ""))

Plotting the results of the clustering shows the clear distinction between the two groups:

docs_cluster %>%
  ggplot(aes(v1, v2, color = factor(cluster))) +
    geom_point() +
    theme_void()

We can use the dsst_metrics function to find the terms most associated with each cluster:

dsst_metrics(anno_cluster, docs_cluster, label_var = "cluster") %>%
  filter(count > expected) %>%
  group_by(label) %>%
  slice_head(n = 25) %>%
  summarize(token = paste0(token, collapse = " ")) %>%
  getElement("token")
## [1] "be the and I to of it have a in for that with this but we they my you n't so not good do as"
## [2] "de et le la un à pour les des que une en pas est qui du je dan avec vous au ce y il c'est"

As you might have guessed, the clusters are related to the language of the review, either English or French. Now, let’s try to label the language of a review. We could use a fancy model, but I think it’s interesting to try to do this with just the data manipulation tools we have access to. To do this, I will save the top-25 words for each cluster:

lang_df <- dsst_metrics(anno_cluster, docs_cluster, label_var = "cluster") %>%
  filter(count > expected) %>%
  group_by(label) %>%
  slice_head(n = 25) %>%
  ungroup() %>%
  mutate(lang = if_else(label == 1, "English", "Français")) %>%
  select(lemma = token, lang)

lang_df
## # A tibble: 50 × 2
##    lemma lang   
##    <chr> <chr>  
##  1 be    English
##  2 the   English
##  3 and   English
##  4 I     English
##  5 to    English
##  6 of    English
##  7 it    English
##  8 have  English
##  9 a     English
## 10 in    English
## # … with 40 more rows

Now, we can do an inner join to the annotation data and count how many terms from each of these two sets constitute each document. Here, we visualize the proportion of each document that comes from one of the two sets:

anno %>%
  inner_join(lang_df, by = "lemma") %>%
  group_by(doc_id) %>%
  summarize(percent_fr = mean(lang == "Français")) %>%
  ggplot(aes(percent_fr)) +
    geom_histogram(bins = 50, color = "black", fill = "white")

We see that most documents are mostly in either English or French, though many have a few words from the other set. I looked at these, and they are mostly either the names of businesses or (mostly in French), short English expressions inserted into the text.

There are a small number of reviews that are in the middle of our plot above (it might be hard to see in the compiled notes, but I could tell in RStudio). Let’s take a look at a few of these:

set.seed(1L) # to make sure I always take the same set

anno %>%
  inner_join(lang_df, by = "lemma") %>%
  group_by(doc_id) %>%
  summarize(percent_fr = mean(lang == "Français")) %>%
  filter(between(percent_fr, 0.4, 0.6)) %>%
  inner_join(docs, by = "doc_id") %>%
  slice_sample(n = 4) %>%
  dsst_print_text(max_chars = 10000)
## doc03216; 0.410714285714286; valid; uid_0046716; Hugo; male; 0.9987;
## 4; Other; Other; -73.4674748; 45.4741756
## 20 janvier 2013 Par un dimanche glacé, une fringale de smoked meat me
## transporte chez Chenoy's, une institution à Brossard depuis 20 ans,
## cette année. L'accueil se fait par le doyen de la place, un petit
## monsieur tout joyeux et heureux de vous servir. Le service s'est fait
## par la charmante France. Quoi choisir ? trop de choix. Le classique
## spécial Chenoy's, rien de mieux. Un club sandwich tout en smoked
## meat. Pas besoin d'entrée, vous serez rassasié ! Essayez aussi la
## poutine au smoked meat. Tant qu'à manger du smoked meat, faites-en
## une soirée ;-) January 20th 2013 On an icy sunday, a snack attack for
## smoked meat leads me to Chenoy's, a Brossard institution for 20 years
## now. Greatings at the entrance from the wisest man of the place, very
## happy to serve you. The waiting at the table was made by the always
## charming France. What to choose from the menu ? too many choices. The
## classic special Chenoy's is a must. A nice smoked meat club sandwich.
## No need for apetizers, you 'll be full by the end of the plate ! Try
## the smoked meat poutine. Why not eat all smoked meat since you're
## there ;-)
## 
## doc03182; 0.434782608695652; train; uid_0046716; Hugo; male; 0.9987;
## 1; McDonald's; Burgers; -73.465759; 45.456398
## Samedi 28 septembre 2013 McDonald's nous montre enfin son incapacité à
## être efficace. Les fins de semaines sont de plus en plus pénibles chez
## McDonalds avec parfois 15 à 20 minutes d'attente avant d'être servi.
## Pour un endroit qui se dit Fast Food, on peut laisser tomber. Saturday
## september 28th 2013 McDonald's is showing us how low the efficiency
## level is going. Every week-ends is worst then the other, you can wait
## up to 15 to 20 minutes before being served. They call it Fast Food,
## NOPE ! Forget it.
## 
## doc00006; 0.440298507462687; train; uid_0000926; Mathieu; male; 1; 4;
## Leméac; French; -73.5972169; 45.5182769
## Version Française à la fin: Second time was successful. I was a bit
## worry about it because my friend was disapointed of his second time
## there. Usually second time feels confortable and not as suprising.
## I think that's the only reason I didn't put a 5 stars on that one.
## I went with my friend and orderd a 2 courses only. I knew my friend
## was really hungry and I was just a little. My first time there I had
## some Rillette Maison which was fabulous but not a real apetizers. It
## was more a communal size side-course. The only thing that gave you it
## was a 1 person apetizer was the 3 small wheat biscotte. So my friend
## also took the Onglet with homemade fries. Onglet is part of the bib,
## it's slightly - even - more tasteful than bib. He was really happy
## with that and told me it was great! I had some Boudin maison with a
## small reduction of apple cider vinegar with as a main course ris de
## veau and loin steak with celeriac purée and seared wild mushrooms .
## The Celeriac purée was one of the best I had tasted in my life. Not
## that I tend to go for it but It's nice when it well done. The cooking
## was great and the size as well. The taste and seasoning were great but
## the textures for my main course was lacking a crunchy texture. I don't
## know if mixing Ris de veau and Loin Steak was the best idea in the
## world but it was certainly interesting. Maybe if the Ris de veau was
## presented in a donut or a baluchon. It's French Bistro so if you don't
## know what a French bistro is, well it's french cuisine classic. Little
## downside for me, there's definetly a Bourgeois feeling to the place,
## like at Moishe. C'était ma deuxième fois au Leméac. Les premières fois
## sont toujours les plus surprenantes et les secondes fois ont un côté
## confortable qui se distingue principalement par un goût de revenez-
## y. Je crois que j'ai p-e diminué mon score sur Leméac mais c'était
## encore très bon. Mon ami a prit ce que j'avais pris la première fois
## que je n'y étais rendu. Il avait pas mal faim et moi pas trop. Il
## a prit l'Onglet de boeuf avec pomme-frite maison (qui sont vraiment
## bonne! (mince julienne, blanchies avant!) Moi j'ai pris le Boudin
## maison avec une purée de pomme de terre, garniture de pommes servit
## avec une réduction de vinaigre de cidre. Avec comme met principale
## le ris de veau avec un contre-filet de bœuf avec champignons poêlés
## et purée de céleri-rave. Hmm une des meilleures purées de Céleri-rave
## que j'ai mangé a vie. Le Céleri-rave est trop peu connu et pas assez
## utilisé dans la cuisine québécoise! C'est délicieux quand c'est bien
## apprêté et ça se prête bien à des réductions . Petit bémol sur la
## texture qui était vraiment un hommage au «mou» . Pas de croquant dans
## mon assiette et je crois que c'est ce qui manquait au plat pour le
## rendre bon. Y se donne le nom de restaurant et oui, c'est une coche ou
## deux au dessus d'un bistro mais c'est au bout du compte, de son décor
## jusqu'aux plats présenté un bistro qui nous présente des classiques de
## la cuisine française avec une petite touche plus poussé. On y retrouve
## définitivement un petit feeling bourgeois qui me rappelle un peu le
## Moishe pas nécessairement mon choix de «foule» .Je suis un peu déçu
## qu'on ne trouve pas une Table d'hôte avant le fameux 24$ - ridicule de
## le garder a 22$ quand a 24$ c'est toujours le meilleur deal en ville
## - après 22h! Quelle bonne façon de poussée une prep qui pourrait être
## moins fraiche en fin de journée . Belle façon de terminé la soirée
## aussi. Fin positive: de l'onglet sur le menu! Quelle belle façon de
## faire découvrir une partie génialissime du boeuf!
## 
## doc03176; 0.478260869565217; valid; uid_0046716; Hugo; male; 0.9987;
## 3; Harvey's; Burgers; -73.467285; 45.471041
## 30 août 2012, 17h30, une petite fringale en passant devant Harvey's
## en faisant mes courses. C'est parfait, je les connais et je les aime
## beaucoup depuis que je suis tout petit. Je vois le nouveau spécial :
## Harvey's Deluxe au Porc Effiloché en combo. Excellent service,
## garniture au goût. Une fois rendu à ma place j'ouvre le hamburger
## pour me rendre compte que le porc effiloché, il n'y en a QU'UNE
## SEULE CUILLÈRÉE À SOUPE ! En fait ce n'est qu'un simple hamburger
## au boeuf ... oui oui ... un simple hamburger de boeuf "garni" de
## porc effiloché pour 7,xx $. Les frites étaient à peine cuites une
## deuxième fois donc plutôt molles et très blanches ! Harvey's VOUS
## VOUS RAMOLLISSEZ ! August 30th 2012, 17h30, a little hungry passing
## by Harvey's while doing some errands. Perfect, I know them since my
## childhood, they are great. I see the new special : Harvey's Deluxe
## with Shredded Porc in combo. Excellent service, toppings at my own
## liking. Once at my place, I open-up the hamburger and realize that it
## has ONLY ONE SPOONFUL OF PORC ! In fact, it's an ordinary beef burger
## "garnished" with shredded porc for 7,xx $. Fries where almost fried
## a second time (you ALWAYS fry twice to get a nice color and a nice
## taste), still very white and mushy ! Harvey's YOU ARE GETTING VERY
## LAZZZZZY !

You can see that these are reviews where someone wrote the review in one language and then translated it into the other language.

Let’s try to see how the language of each review corresponds to the reviewers.

lang_by_user <- anno %>%
  inner_join(lang_df, by = "lemma") %>%
  group_by(doc_id) %>%
  summarize(percent_fr = mean(lang == "Français")) %>%
  mutate(lang = if_else(percent_fr > 0.75, "Français", "Mixed")) %>%
  mutate(lang = if_else(percent_fr < 0.25, "English", lang)) %>%
  inner_join(docs, by = "doc_id") %>%
  group_by(user_name) %>%
  summarize(percent_fr = mean(lang == "Français"),
            percent_mix = mean(lang == "Mixed"),
            percent_en = mean(lang == "English"))

To start, how does this distribution compare to the document-level one:

lang_by_user %>%
  ggplot(aes(percent_fr)) +
    geom_histogram(bins = 50, color = "black", fill = "white")

And who are the reviewers that use a mixture of the two languages:

lang_by_user %>%
  arrange(desc(percent_mix))
## # A tibble: 100 × 4
##    user_name percent_fr percent_mix percent_en
##    <chr>          <dbl>       <dbl>      <dbl>
##  1 Hugo          0.0143      0.986      0     
##  2 Mathieu       0.443       0.2        0.357 
##  3 Natalie       0.956       0.0441     0     
##  4 Julien 2      0.957       0.0429     0     
##  5 Judith        0.986       0.0143     0     
##  6 Matthew       0           0.0143     0.986 
##  7 Sarah 2       0.971       0.0143     0.0143
##  8 Simon         0.986       0.0143     0     
##  9 Aimee         0           0          1     
## 10 Alex          0           0          1     
## # … with 90 more rows

Finally, let’s see how the language of the bilingual reviews from Hugo and Mathieu corresponds to the half of the review.

anno %>%
  group_by(doc_id) %>%
  mutate(half = if_else(row_number() > n() * 0.5, "1st", "2nd")) %>%
  inner_join(lang_df, by = "lemma") %>%
  group_by(doc_id) %>%
  mutate(percent_fr = mean(lang == "Français")) %>%
  filter(between(percent_fr, 0.3, .7)) %>%
  group_by(half, doc_id) %>%
  summarize(percent_fr = mean(lang == "Français")) %>%
  inner_join(docs, by = "doc_id") %>%
  filter(user_name %in% c("Hugo", "Mathieu")) %>%
  group_by(half, user_name) %>%
  summarize(percent_fr = mean(percent_fr))
## # A tibble: 4 × 3
## # Groups:   half [2]
##   half  user_name percent_fr
##   <chr> <chr>          <dbl>
## 1 1st   Hugo          0.0274
## 2 1st   Mathieu       0.626 
## 3 2nd   Hugo          0.965 
## 4 2nd   Mathieu       0.380

We see that Hugo almost always starts in English and follows in French. Mathieu is more likely to start in French and switch into English, but is less consistent.

Now, what type of business are most associated with each language? Let’s check the French reviews first:

lang_by_user %>%
  inner_join(docs, by = "user_name") %>%
  mutate(lang = if_else(percent_fr > 0.5, "Français", "English")) %>%
  group_by(biz_category) %>%
  summarize(avg_fr = mean(lang == "Français")) %>%
  arrange(desc(avg_fr))
## # A tibble: 41 × 2
##    biz_category       avg_fr
##    <chr>               <dbl>
##  1 Brasseries          0.436
##  2 Cinema              0.435
##  3 Delicatessen        0.397
##  4 Shopping Centers    0.353
##  5 Grocery             0.328
##  6 Portuguese          0.322
##  7 Sushi Bars          0.321
##  8 Coffee & Tea        0.315
##  9 Tapas/Small Plates  0.3  
## 10 Creperies           0.296
## # … with 31 more rows

And with the English reviews?

lang_by_user %>%
  inner_join(docs, by = "user_name") %>%
  mutate(lang = if_else(percent_fr > 0.5, "Français", "English")) %>%
  group_by(biz_category) %>%
  summarize(avg_fr = mean(lang == "Français")) %>%
  arrange(avg_fr)
## # A tibble: 41 × 2
##    biz_category avg_fr
##    <chr>         <dbl>
##  1 Soup         0.08  
##  2 Chinese      0.0964
##  3 Greek        0.0978
##  4 Thai         0.111 
##  5 Chicken Shop 0.118 
##  6 Korean       0.148 
##  7 British      0.151 
##  8 Burgers      0.156 
##  9 Mexican      0.159 
## 10 Italian      0.184 
## # … with 31 more rows

There’s a lot more that you could do here, but I thought this gave a good example and review of how we can use unsupervised learning techniques to understand a rich data set.