Load the Data

As in the previous notes, I will use the Amazon product classification task. We will read in the docs and anno tables:

docs <- read_csv("../data/amazon_product_class.csv")
anno <- read_csv("../data/amazon_product_class_token.csv.gz")

These notes will briefly introduce a new measurement for associating a term with a particular category.

G-Score

So far, we have focused on building predictive models based on features composed on word counts. We’ve had the twin goals of identifying words that are strongly associated (negatively or positively) with each category and understanding how different the categories are from one another. Can we easily tell them apart? Can we tell some of the apart? Can we tell some of them apart some of the time? And so forth.

If we are only interested in the first question of what terms are associated with each category, we can try to do this more directly measuring the strength of the relationship between the occurance of a word and the presence of a particular category. There are a few different such scores that are commonly used with textual data. The one I find the best is called the G-score; it provides a single number that tells how strongly a single term is associated with a given category.

We will derive the G-score more fully through the slides in class. Here we’ll just look at a function to that computes the value:

dsst_metrics(anno, docs)
## # A tibble: 98,899 × 8
##    train_id label token  count expected count_word gscore   chi2
##    <chr>    <chr> <chr>  <int>    <dbl>      <int>  <dbl>  <dbl>
##  1 train    book  book    3789     939.       3934  9726. 11397.
##  2 train    book  -PRON- 25401   14658.      61435  9557. 10769.
##  3 train    book  the    16893    9993.      41882  5728.  6432.
##  4 train    book  ,      14740    8713.      36517  4991.  5609.
##  5 train    book  .      15492    9425.      39504  4714.  5263.
##  6 train    book  read    1724     447.       1873  3988.  4799.
##  7 train    book  to      8380    4470.      18736  3953.  4546.
##  8 train    book  be     13890    8636.      36194  3868.  4300.
##  9 train    book  of      8607    4703.      19713  3767.  4310.
## 10 train    book  and     9811    5666.      23750  3576.  4043.
## # … with 98,889 more rows

By default the table is ordered by category and then in ascending order by the gscore. You’ll probably want to select only the top scores for each label, which can be done as follows:

dsst_metrics(anno, docs) %>%
  group_by(label) %>%
  slice_head(n = 4)
## # A tibble: 12 × 8
## # Groups:   label [3]
##    train_id label token  count expected count_word gscore   chi2
##    <chr>    <chr> <chr>  <int>    <dbl>      <int>  <dbl>  <dbl>
##  1 train    book  book    3789    939.        3934  9726. 11397.
##  2 train    book  -PRON- 25401  14658.       61435  9557. 10769.
##  3 train    book  the    16893   9993.       41882  5728.  6432.
##  4 train    book  ,      14740   8713.       36517  4991.  5609.
##  5 train    film  the    18872  10252.       41882  8657.  9863.
##  6 train    film  film    2404    597.        2439  6431.  7253.
##  7 train    film  ,      15384   8939.       36517  5603.  6302.
##  8 train    film  movie   2235    570.        2327  5578.  6457.
##  9 train    food  -PRON- 14729   7130.       61435  7675.  9542.
## 10 train    food  .       8860   4585.       39504  3789.  4629.
## 11 train    food  taste    949    116.        1000  3702.  6768.
## 12 train    food  flavor   781     93.0        801  3185.  5764.

And as with previous notes, we can modify and filter the initial dataset before running the metrics:

anno %>%
  mutate(lemma = if_else(upos == "PRON", tolower(token), lemma)) %>%
  mutate(lemma = if_else(lemma == "i", "I", lemma)) %>%
  filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB", "PRON")) %>%
  dsst_metrics(docs) %>%
  group_by(label) %>%
  slice_head(n = 4)
## # A tibble: 12 × 8
## # Groups:   label [3]
##    train_id label token  count expected count_word gscore   chi2
##    <chr>    <chr> <chr>  <int>    <dbl>      <int>  <dbl>  <dbl>
##  1 train    book  book    3786    950.        3931  9656. 11229.
##  2 train    book  read    1723    452.        1872  3949.  4719.
##  3 train    book  I       5874   3607.       14926  1724.  1920.
##  4 train    book  author   582    145.         602  1490.  1729.
##  5 train    film  film    2398    567.        2433  6666.  7739.
##  6 train    film  movie   2234    542.        2326  5806.  6911.
##  7 train    film  it      4662   2841.       12195  1383.  1548.
##  8 train    film  watch    649    173.         742  1382.  1712.
##  9 train    food  I       5376   1857.       14926  5603.  7779.
## 10 train    food  taste    949    124.        1000  3575.  6250.
## 11 train    food  flavor   776     99.0        796  3059.  5290.
## 12 train    food  it      3440   1517.       12195  2204.  2831.

The last column is another measurement called the chi-squared statistic. If you’ve had a previous statistics course, this corresponds to the classic chi-squared test for a 2-by-2 table. We won’t use it much here, but I included it just in case you want another measurment to compare the G-score to.