As in the previous notes, I will use the Amazon product
classification task. We will read in the docs
and
anno
tables:
<- read_csv("../data/amazon_product_class.csv")
docs <- read_csv("../data/amazon_product_class_token.csv.gz") anno
These notes will briefly introduce a new measurement for associating a term with a particular category.
So far, we have focused on building predictive models based on features composed on word counts. We’ve had the twin goals of identifying words that are strongly associated (negatively or positively) with each category and understanding how different the categories are from one another. Can we easily tell them apart? Can we tell some of the apart? Can we tell some of them apart some of the time? And so forth.
If we are only interested in the first question of what terms are associated with each category, we can try to do this more directly measuring the strength of the relationship between the occurance of a word and the presence of a particular category. There are a few different such scores that are commonly used with textual data. The one I find the best is called the G-score; it provides a single number that tells how strongly a single term is associated with a given category.
We will derive the G-score more fully through the slides in class. Here we’ll just look at a function to that computes the value:
dsst_metrics(anno, docs)
## # A tibble: 98,899 × 8
## train_id label token count expected count_word gscore chi2
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 train book book 3789 939. 3934 9726. 11397.
## 2 train book -PRON- 25401 14658. 61435 9557. 10769.
## 3 train book the 16893 9993. 41882 5728. 6432.
## 4 train book , 14740 8713. 36517 4991. 5609.
## 5 train book . 15492 9425. 39504 4714. 5263.
## 6 train book read 1724 447. 1873 3988. 4799.
## 7 train book to 8380 4470. 18736 3953. 4546.
## 8 train book be 13890 8636. 36194 3868. 4300.
## 9 train book of 8607 4703. 19713 3767. 4310.
## 10 train book and 9811 5666. 23750 3576. 4043.
## # … with 98,889 more rows
By default the table is ordered by category and then in ascending order by the gscore. You’ll probably want to select only the top scores for each label, which can be done as follows:
dsst_metrics(anno, docs) %>%
group_by(label) %>%
slice_head(n = 4)
## # A tibble: 12 × 8
## # Groups: label [3]
## train_id label token count expected count_word gscore chi2
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 train book book 3789 939. 3934 9726. 11397.
## 2 train book -PRON- 25401 14658. 61435 9557. 10769.
## 3 train book the 16893 9993. 41882 5728. 6432.
## 4 train book , 14740 8713. 36517 4991. 5609.
## 5 train film the 18872 10252. 41882 8657. 9863.
## 6 train film film 2404 597. 2439 6431. 7253.
## 7 train film , 15384 8939. 36517 5603. 6302.
## 8 train film movie 2235 570. 2327 5578. 6457.
## 9 train food -PRON- 14729 7130. 61435 7675. 9542.
## 10 train food . 8860 4585. 39504 3789. 4629.
## 11 train food taste 949 116. 1000 3702. 6768.
## 12 train food flavor 781 93.0 801 3185. 5764.
And as with previous notes, we can modify and filter the initial dataset before running the metrics:
%>%
anno mutate(lemma = if_else(upos == "PRON", tolower(token), lemma)) %>%
mutate(lemma = if_else(lemma == "i", "I", lemma)) %>%
filter(upos %in% c("ADJ", "ADV", "NOUN", "VERB", "PRON")) %>%
dsst_metrics(docs) %>%
group_by(label) %>%
slice_head(n = 4)
## # A tibble: 12 × 8
## # Groups: label [3]
## train_id label token count expected count_word gscore chi2
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 train book book 3786 950. 3931 9656. 11229.
## 2 train book read 1723 452. 1872 3949. 4719.
## 3 train book I 5874 3607. 14926 1724. 1920.
## 4 train book author 582 145. 602 1490. 1729.
## 5 train film film 2398 567. 2433 6666. 7739.
## 6 train film movie 2234 542. 2326 5806. 6911.
## 7 train film it 4662 2841. 12195 1383. 1548.
## 8 train film watch 649 173. 742 1382. 1712.
## 9 train food I 5376 1857. 14926 5603. 7779.
## 10 train food taste 949 124. 1000 3575. 6250.
## 11 train food flavor 776 99.0 796 3059. 5290.
## 12 train food it 3440 1517. 12195 2204. 2831.
The last column is another measurement called the chi-squared statistic. If you’ve had a previous statistics course, this corresponds to the classic chi-squared test for a 2-by-2 table. We won’t use it much here, but I included it just in case you want another measurment to compare the G-score to.