Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

Time API

Create a URL string to find the current time in the “America/Chicago” timezone.

# Question 01
url_str <- modify_url(
  "https://www.timeapi.io/api/Time/current/zone",
  query = list("timeZone" = "America/Chicago")
)

Run the query and print out the returned object. Verify that the time is one hour behind the current time in Richmond, VA.

# Question 02
res <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
obj <- content(res, type = "application/json")
obj
## $year
## [1] 2023
## 
## $month
## [1] 3
## 
## $day
## [1] 25
## 
## $hour
## [1] 10
## 
## $minute
## [1] 40
## 
## $seconds
## [1] 26
## 
## $milliSeconds
## [1] 581
## 
## $dateTime
## [1] "2023-03-25T10:40:26.5812773"
## 
## $date
## [1] "03/25/2023"
## 
## $time
## [1] "10:40"
## 
## $timeZone
## [1] "America/Chicago"
## 
## $dayOfWeek
## [1] "Saturday"
## 
## $dstActive
## [1] TRUE

The time API has another access point that gives the current time based on a latitude and longitude. The query starts with the URL “https://www.timeapi.io/api/Time/current/coordinate” and requires two query parameters, latitude and longitude. Create a query string to detect the time in the city of Victoria, Seychelles. You’ll need to look up its latitude and longitude online.

# Question 03
url_str <- modify_url(
  "https://www.timeapi.io/api/Time/current/coordinate",
  query = list("latitude" = -4.6167, "longitude" = 55.45)
)

Run the query and print out the results. How far ahead is the time in the Seychelles from the time in Virginia?

# Question 04
res <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
obj <- content(res, type = "application/json")
obj
## $year
## [1] 2023
## 
## $month
## [1] 3
## 
## $day
## [1] 25
## 
## $hour
## [1] 19
## 
## $minute
## [1] 43
## 
## $seconds
## [1] 1
## 
## $milliSeconds
## [1] 120
## 
## $dateTime
## [1] "2023-03-25T19:43:01.1209481"
## 
## $date
## [1] "03/25/2023"
## 
## $time
## [1] "19:43"
## 
## $timeZone
## [1] "Indian/Mahe"
## 
## $dayOfWeek
## [1] "Saturday"
## 
## $dstActive
## [1] FALSE

CNN Lite

Let’s now return to the CNN Lite dataset. You’ll have different results from the notes because you are creating the dataset on a different day. Note that while you should be able to copy much of the code directly from the notes (and that’s okay!), try to only copy the code that is actually needed rather than all the intermediate steps I only showed to illustrate the method.

To start, grab the HTML data from the front page of the CNN Lite website.

# Question 05
url_str <- modify_url("https://lite.cnn.com/")
res <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
obj <- content(res, type = "text/html", encoding = "UTF-8")

Extract the links to each story and save them as an object called links.

# Question 06
temp <- xml_find_all(obj, "..//li/a")
links <- xml_attr(temp, "href")

Use the code below to create short titles for each of the documents. We will use these later as document ids.

title <- xml_text(xml_find_all(obj, "..//li/a"))
title <- stri_trim(title)
title <- stri_sub(title, 1L, 40L)
head(title)
## [1] "Gwyneth Paltrow could take stand in ski "
## [2] "House passes GOP education bill that aim"
## [3] "Alzheimer’s first signs may appear in yo"
## [4] "Opinion: Why an indictment might end up "
## [5] "Trump attorney set to testify before gra"
## [6] "What the banking crisis means for your j"

Now, iterate over the stories and create a vector text_all that has one element for each story that contains the entire story’s text.

# Question 07
text_all <- rep("", length(links))
for (j in seq_along(links))
{
  url_str <- modify_url(paste0("https://lite.cnn.com/", links[j]))
  res <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
  obj <- content(res, type = "text/html", encoding = "UTF-8")
  
  text <- xml_find_all(obj, "..//p[@class='paragraph--lite']")
  text <- xml_text(text)
  text <- stri_trim(text)
  
  text_all[j] <- paste0(text, collapse = " ")
}

Build a docs table, but unlike in the notes use the short title of the document as the doc_id. Print out the table and verify that it matches the current version of the website.

# Question 08
docs <- tibble(
  doc_id = title,
  train_id = "train",
  text = text_all
)
docs
## # A tibble: 100 × 3
##    doc_id                                     train_id text                 
##    <chr>                                      <chr>    <chr>                
##  1 "Gwyneth Paltrow could take stand in ski " train    Actress Gwyneth Palt…
##  2 "House passes GOP education bill that aim" train    The House voted Frid…
##  3 "Alzheimer’s first signs may appear in yo" train    The eyes are more th…
##  4 "Opinion: Why an indictment might end up " train    The news of a potent…
##  5 "Trump attorney set to testify before gra" train    Evan Corcoran, Donal…
##  6 "What the banking crisis means for your j" train    One of the biggest u…
##  7 "Colorectal cancer is rising among younge" train    Nikki Lawson receive…
##  8 "This woman left her AirPods on a plane. " train    We’ve had people tra…
##  9 "‘Everything has changed’: Users of recal" train    Renee Martray of Sou…
## 10 "‘Hotel Rwanda’ hero to be released from " train    Paul Rusesabagina, w…
## # … with 90 more rows

Run the annotation algorithm over the documents. Print out the anno table and verify that it has the expected structure that matches what we have been using all semester.

# Question 09
library(cleanNLP)
cnlp_init_udpipe("english")

docs <- filter(docs, stringi::stri_length(text) > 0)
anno <- cnlp_annotate(docs)$token
## Processed document 10 of 100
## Processed document 20 of 100
## Processed document 30 of 100
## Processed document 40 of 100
## Processed document 50 of 100
## Processed document 60 of 100
## Processed document 70 of 100
## Processed document 80 of 100
## Processed document 90 of 100
## Processed document 100 of 100
anno
## # A tibble: 90,579 × 11
##    doc_id    sid tid   token token…¹ lemma upos  xpos  feats tid_s…² relat…³
##  * <chr>   <int> <chr> <chr> <chr>   <chr> <chr> <chr> <chr> <chr>   <chr>  
##  1 "Gwyne…     1 1     Actr… "Actre… actr… PROPN NNP   Numb… 4       nsubj  
##  2 "Gwyne…     1 2     Gwyn… "Gwyne… Gwyn… PROPN NNP   Numb… 1       flat   
##  3 "Gwyne…     1 3     Palt… "Paltr… Palt… PROPN NNP   Numb… 1       flat   
##  4 "Gwyne…     1 4     took  "took " take  VERB  VBD   Mood… 0       root   
##  5 "Gwyne…     1 5     the   "the "  the   DET   DT    Defi… 6       det    
##  6 "Gwyne…     1 6     stand "stand… stand NOUN  NN    Numb… 4       obj    
##  7 "Gwyne…     1 7     to    "to "   to    PART  TO    <NA>  8       mark   
##  8 "Gwyne…     1 8     test… "testi… test… VERB  VB    Verb… 6       acl    
##  9 "Gwyne…     1 9     on    "on "   on    ADP   IN    <NA>  10      case   
## 10 "Gwyne…     1 10    Frid… "Frida… Frid… PROPN NNP   Numb… 8       obl    
## # … with 90,569 more rows, and abbreviated variable names ¹​token_with_ws,
## #   ²​tid_source, ³​relation

Now, with the textual data, create a PCA plot of the 100 news articles using just nouns and verbs:

# Question 10
anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_pca() %>%
  ggplot(aes(v1, v2)) +
    geom_point()
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead

Let’s add some context to the above plot. The titles would be too long to read. instead, create a data table called tterms that contains one noun or verb with the highest G-score that is associated with each document.

# Question 11
tterms <- anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_metrics(docs, label_var = "doc_id") %>%
  filter(count > expected) %>%
  group_by(label) %>%
  slice_head(n = 1L) %>%
  select(doc_id = label, top_term = token)

Start by repeating the plot in question 10, but join to the tterms data and add a text repel layer showing the strongest associated word with each document. You might want to set the size of the labels to something smaller (2?) to make the plot easier to read. Can you start to explain the shape of the PCA plot?

# Question 12
anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_pca() %>%
  inner_join(tterms, by = "doc_id") %>%
  ggplot(aes(v1, v2)) +
    geom_point() +
    geom_text_repel(aes(label = top_term), size = 2)

Repeat the previous question but using UMAP instead of PCA. Note that the plot functions very differently (at least with the stories I had the day I wrote this, the UMAP was more interesting).

# Question 13
anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_umap() %>%
  inner_join(tterms, by = "doc_id") %>%
  ggplot(aes(v1, v2)) +
    geom_point() +
    geom_text_repel(aes(label = top_term), size = 2)

Finally, let’s build a topic model using 16 topics and only nouns, verbs, adjectives, and adverbs. Save the result as an object called model.

# Question 14
model <- anno %>%
  filter(upos %in% c("NOUN", "VERBS", "ADJ", "ADV")) %>%
  dsst_lda_build(num_topics = 16)

Export the results as a JSON file and explore using the topic model visualization tool here: https://statsmaths.github.io/topic-explo/build/

# Question 15
dsst_json_lda(model, docs)