Timezone API

Let’s start with a relatively simple API that tells the current time in different time zones. This is probably not a good candidate for caching, but let’s use it anyway as an example. To start, we specify the URL using the protocol, authority, path, and query parameters.

url_str <- modify_url(
  "https://www.timeapi.io/api/Time/current/zone",
  query = list("timeZone" = "Europe/Amsterdam")
)

Next, we call the HTTP GET method using our cache wrapper function. The result can be parsed as JSON using the content function and and appropriate type.

res <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
obj <- content(res, type = "application/json")

The object obj is a list object, in this case a set of name/values pairs.

obj
## $year
## [1] 2023
## 
## $month
## [1] 3
## 
## $day
## [1] 25
## 
## $hour
## [1] 14
## 
## $minute
## [1] 53
## 
## $seconds
## [1] 43
## 
## $milliSeconds
## [1] 327
## 
## $dateTime
## [1] "2023-03-25T14:53:43.3276041"
## 
## $date
## [1] "03/25/2023"
## 
## $time
## [1] "14:53"
## 
## $timeZone
## [1] "Europe/Amsterdam"
## 
## $dayOfWeek
## [1] "Saturday"
## 
## $dstActive
## [1] FALSE

We can access any specific element using the dollar sign operator, just as we do with the objects returned by functions such as dsst_enet_build.

obj$minute
## [1] 53

CNN Lite

While API is usually used to describe access points designed specifically for programs to access data, we can use the same ideas to scrape data from a website. Your browser can be thought of as a program that uses an API to access data in the form HTML, CSS, and JavaScript. Take a moment to look at the CNN Lite website. We’ll try to grab data from this page from within R.

The “API” here is simple; it has no query parameters. The data that is returned is in a markup language called HTML, so we change the type of data that is returned by the content function:

url_str <- modify_url("https://lite.cnn.com/")
res <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
obj <- content(res, type = "text/html", encoding = "UTF-8")

The object returned is a special type of R class that handles XML/HTML data.

obj
## {html_document}
## <html lang="en" data-layout-uri="cms.cnn.com/_layouts/layout-homepage/instances/cnnlite-v1@published">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=U ...
## [2] <body class="cnn">\n      <header class="header--lite"><a href="/" cl ...

We’ll cover more details over the next few classes about how to use XML/HTML objects. Here we’ll just dive into some examples. To start, I’ll use the xml_find_all function to find links (the tag “a”) that are inside list items (the tag “li”). These return each of the 100 stories on the front page of the CNN lite website.

xml_find_all(obj, "..//li/a")
## {xml_nodeset (100)}
##  [1] <a href="/2023/03/24/entertainment/gwyneth-paltrow-ski-collision-tri ...
##  [2] <a href="/2023/03/24/politics/house-vote-parents-bill-of-rights-act" ...
##  [3] <a href="/2023/03/24/health/eye-early-alzheimers-diagnosis-wellness" ...
##  [4] <a href="/2023/03/24/opinions/indictment-might-end-up-helping-trump- ...
##  [5] <a href="/2023/03/24/politics/evan-corcoran-testimony-documents-prob ...
##  [6] <a href="/2023/03/24/economy/federal-reserve-unemployment-projection ...
##  [7] <a href="/2023/03/24/health/colorectal-cancer-young-age-mystery">\n  ...
##  [8] <a href="/travel/article/airpods-tracked-down">\n        This woman  ...
##  [9] <a href="/2023/03/24/health/eye-infection-patients">\n        ‘Every ...
## [10] <a href="/2023/03/24/africa/paul-rusesabagina-released-rwanda-intl"> ...
## [11] <a href="/2023/03/24/us/baltimore-county-truck-crash">\n        Truc ...
## [12] <a href="/2023/03/24/entertainment/yellowjackets-season-2-review">\n ...
## [13] <a href="/2023/03/24/tech/china-opposes-tiktok-sale-approval-needed- ...
## [14] <a href="/2023/03/24/us/denver-colorado-school-shooting-friday">\n   ...
## [15] <a href="/2023/03/24/cars/eu-combustion-engine-debate-climate-intl"> ...
## [16] <a href="/2023/03/24/tech/twitter-verified-checkmarks">\n        Pay ...
## [17] <a href="/2023/03/24/sport/sweet-16-fau-tennessee-upset-march-madnes ...
## [18] <a href="/2023/03/24/app-news-section/videos-of-the-week-mobile-marc ...
## [19] <a href="/2023/03/24/middleeast/israel-netanyahu-judicial-overhaul-i ...
## [20] <a href="/2023/03/24/sport/aurelien-sanchez-barkley-marathons-ultrar ...
## ...

We can extract the links to these pages using the function xml_attr and grabbing the “href” tag.

temp <- xml_find_all(obj, "..//li/a")
links <- xml_attr(temp, "href")
head(links)
## [1] "/2023/03/24/entertainment/gwyneth-paltrow-ski-collision-trial-friday"
## [2] "/2023/03/24/politics/house-vote-parents-bill-of-rights-act"          
## [3] "/2023/03/24/health/eye-early-alzheimers-diagnosis-wellness"          
## [4] "/2023/03/24/opinions/indictment-might-end-up-helping-trump-zelizer"  
## [5] "/2023/03/24/politics/evan-corcoran-testimony-documents-probe"        
## [6] "/2023/03/24/economy/federal-reserve-unemployment-projections"

Once we have the links, we can grab the actual content of a specific link. For example, here we grab the first page:

url_str <- modify_url(paste0("https://lite.cnn.com/", links[1]))
res <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
obj <- content(res, type = "text/html", encoding = "UTF-8")

Once again, the page is an HTML document.

obj
## {html_document}
## <html lang="en" data-layout-uri="cms.cnn.com/_layouts/layout-with-rail/instances/entertainment-article-v1@published">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=U ...
## [2] <body class="entertainment">\n          <div>\n          <header clas ...

From this, we can get each paragraph in the document:

xml_find_all(obj, "..//p[@class='paragraph--lite']")
## {xml_nodeset (48)}
##  [1] <p class="paragraph--lite">\n  Actress<a href="https://www.cnn.com/e ...
##  [2] <p class="paragraph--lite">\n  The actress and businesswoman has bee ...
##  [3] <p class="paragraph--lite">\n  Sanderson has accused Paltrow of cras ...
##  [4] <p class="paragraph--lite">\n  Paltrow <a href="https://www.cnn.com/ ...
##  [5] <p class="paragraph--lite">\n  The two have been in a legal battle f ...
##  [6] <p class="paragraph--lite">\n  Paltrow took the stand just before 3p ...
##  [7] <p class="paragraph--lite">\n  The Goop founder said on the day of t ...
##  [8] <p class="paragraph--lite">\n  The trip, Paltrow said, was the first ...
##  [9] <p class="paragraph--lite">\n  The collision happened on the first d ...
## [10] <p class="paragraph--lite">\n  All of the children, who are around t ...
## [11] <p class="paragraph--lite">\n  When asked if she was a good tipper,  ...
## [12] <p class="paragraph--lite">\n  Paltrow then repeated her assertion t ...
## [13] <p class="paragraph--lite">\n   “He struck me in the back, yes, that ...
## [14] <p class="paragraph--lite">\n  At one point, VanOrman had hoped to u ...
## [15] <p class="paragraph--lite">\n  Paltrow testified that she was trying ...
## [16] <p class="paragraph--lite">\n  She testified Friday that two skis ca ...
## [17] <p class="paragraph--lite">\n  VanOrman walked around the courtroom  ...
## [18] <p class="paragraph--lite">\n  Paltrow said that at one point during ...
## [19] <p class="paragraph--lite">\n  They both came crashing down together ...
## [20] <p class="paragraph--lite">\n  Paltrow said she did not ask about th ...
## ...

And with a little bit of cleaning, we have the full text of the article in R:

text <- xml_find_all(obj, "..//p[@class='paragraph--lite']")
text <- xml_text(text)
text <- stri_trim(text)
head(text)
## [1] "Actress Gwyneth Paltrow took the stand to testify on Friday in a Utah trial over a 2016 snow skiing accident, painting a picture of her version of events relating to the incident over roughly two hours of testimony."                                                                                          
## [2] "The actress and businesswoman has been present in the courtroom since the trial began on Tuesday when lawyers representing Paltrow and Terry Sanderson, a 76-year-old retired optometrist, presented their opening statements to a seated jury."                                                                  
## [3] "Sanderson has accused Paltrow of crashing into him and causing him lasting injuries and brain damage while they were both skiing on a beginner’s run on a Utah mountain in February of 2016. Sanderson also claims Paltrow and her ski instructor skied away after the incident without getting him medical care."
## [4] "Paltrow filed a countersuit against Sanderson in 2019 claiming that he skied into her."                                                                                                                                                                                                                           
## [5] "The two have been in a legal battle for seven years."                                                                                                                                                                                                                                                             
## [6] "Paltrow took the stand just before 3pm local time and was questioned by Sanderson’s attorney, Kristin A. VanOrman."

Iteration over CNN Lite

We can use a for loop to cycle over each of the links and store the text from all 100 stories. Let’s try to do that now!

text_all <- rep("", length(links))
for (j in seq_along(links))
{
  url_str <- modify_url(paste0("https://lite.cnn.com/", links[j]))
  res <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
  obj <- content(res, type = "text/html", encoding = "UTF-8")
  
  text <- xml_find_all(obj, "..//p[@class='paragraph--lite']")
  text <- xml_text(text)
  text <- stri_trim(text)
  
  text_all[j] <- paste0(text, collapse = " ")
}

Now, we can create a dataset that looks a lot like the docs tables that we have been working with all semester:

docs <- tibble(
  doc_id = sprintf("doc%04d", seq_along(links)),
  train_id = "train",
  text = text_all
)
docs
## # A tibble: 100 × 3
##    doc_id  train_id text                                                    
##    <chr>   <chr>    <chr>                                                   
##  1 doc0001 train    Actress Gwyneth Paltrow took the stand to testify on Fr…
##  2 doc0002 train    The House voted Friday to pass a controversial bill tha…
##  3 doc0003 train    The eyes are more than a window to the soul — they’re a…
##  4 doc0004 train    The news of a potential indictment would likely derail …
##  5 doc0005 train    Evan Corcoran, Donald Trump’s primary defense attorney,…
##  6 doc0006 train    One of the biggest unknowns since the Federal Reserve s…
##  7 doc0007 train    Nikki Lawson received the shock of her life at age 35. …
##  8 doc0008 train    We’ve had people tracking their bags when airlines can’…
##  9 doc0009 train    Renee Martray of South Carolina has severe and permanen…
## 10 doc0010 train    Paul Rusesabagina, who inspired the Hollywood film “Hot…
## # … with 90 more rows

Annotation

Now, we need to create the anno table from the docs table. In the past I have given this to you, but this time you will have to make it yourself. The algorithm I used requires setting up Python, which is more trouble than it is worth for one class project. Let’s instead use a C-based algorithm that requires no additional setup.

Here is the code to run the annotations over the documents. We will also remove any empty documents in the process, which can cause bugs later on.

library(cleanNLP)
cnlp_init_udpipe("english")

docs <- filter(docs, stringi::stri_length(text) > 0)
anno <- cnlp_annotate(docs)$token
## Processed document 10 of 100
## Processed document 20 of 100
## Processed document 30 of 100
## Processed document 40 of 100
## Processed document 50 of 100
## Processed document 60 of 100
## Processed document 70 of 100
## Processed document 80 of 100
## Processed document 90 of 100
## Processed document 100 of 100

The annotation process takes some time, but shouldn’t be too bad with only 100 short documents.

And then?

Now, we can use all of the functions we have had in class on the data. There is no straightforward predictive task, but we can use any of the unsupervised algorithms to study the data. For example, here are the words with the highest G-scores associated with each news article:

anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_metrics(docs, label_var = "doc_id") %>%
  filter(count > expected) %>%
  group_by(label) %>%
  slice_head(n = 6L) %>%
  summarize(terms = paste0(token, collapse = "; ")) %>%
  getElement("terms")
##   [1] "collision; ski; testify; accident; damage; stand"              
##   [2] "school; parent; classroom; bill; vote; child"                  
##   [3] "disease; study; cell; brain; eye; decline"                     
##   [4] "indictment; Trump; candidate; supporter; voter; persona"       
##   [5] "jury; document; search; prosecutor; probe; subpoena"           
##   [6] "banking; stability; rate; economist; crisis; sector"           
##   [7] "cancer; patient; adult; factor; weight; rise"                  
##   [8] "plane; employee; detective; husband; airport; track"           
##   [9] "eye; infection; outbreak; vision; tear; drop"                  
##  [10] "official; release; government; aide; sentence; family"         
##  [11] "fire; crash; department; explosion; fuel; diesel"              
##  [12] "season; character; mystery; other; alive; asham"               
##  [13] "sale; algorithm; technology; recommendation; accordance; force"
##  [14] "school; police; officer; teacher; student; shoot"              
##  [15] "climate; fuel; car; fleet; exception; allow"                   
##  [16] "tweet; account; program; user; revenue; company"               
##  [17] "game; victory; edge; half; play; man"                          
##  [18] "pilot; history; briefe; caring; celebrate; play"               
##  [19] "conflict; interest; ally; violate; speech; nation"             
##  [20] "race; course; finish; sleep; mile; loops"                      
##  [21] "attack; coalition; troops; target; carry; drone"               
##  [22] "strike; traveler; protest; visit; destination; disruption"     
##  [23] "murder; identify; killer; investigator; find; know"            
##  [24] "rate; inflation; home; buyer; slow; yield"                     
##  [25] "protester; pension; government; reform; police; retirement"    
##  [26] "child; diagnose; identification; prevalence; trend; detection" 
##  [27] "visit; postpone; pension; reform; confirm; travel"             
##  [28] "opposition; democracy; party; conviction; disqualify; low"     
##  [29] "bank; banks; investor; bond; market; index"                    
##  [30] "app; user; filter; platform; video; difference"                
##  [31] "city; force; deport; troops; town; region"                     
##  [32] "data; security; app; information; collect; user"               
##  [33] "storm; expect; flood; rain; watch; wind"                       
##  [34] "jury; attorney; source; witness; prosecutor; hear"             
##  [35] "gear; film; production; show; series; length"                  
##  [36] "floor; pavement; shoe; visitor; walk; ceremony"                
##  [37] "authority; check; operate; company; target; background"        
##  [38] "leak; water; compound; contain; monitor; milligram"            
##  [39] "wave; art; copy; today; produce; sell"                         
##  [40] "record; win; penalty; score; goal; glory"                      
##  [41] "team; pride; player; night; community; celebration"            
##  [42] "testify; ski; Paltrow; Plaintiff; injury; videotaped"          
##  [43] "flash; flooding; flood; rain; soil; water"                     
##  [44] "woman; transgender; athlete; tran; sport; advantage"           
##  [45] "abuse; detention; survivor; rights; prison; camp"              
##  [46] "manufacturing; cabinet; highlight; infrastructure; stop; week" 
##  [47] "eat; disorder; community; faith; health; illness"              
##  [48] "test; missile; cruise; drone; weapon; analyst"                 
##  [49] "water; claim; statement; presence; dispute; operation"         
##  [50] "minister; overhaul; declare; law; settlement; sit"             
##  [51] "ubs; bank; credit; deal; Suisse; franc"                        
##  [52] "arm; war; relationship; representative; weapon; defense"       
##  [53] "zoo; officer; fire; arrive; animal; escape"                    
##  [54] "vehicle; driver; zone; police; crash; state"                   
##  [55] "algorithm; technology; sale; regulator; data; recommendation"  
##  [56] "lawmaker; million; firm; own; hearing; value"                  
##  [57] "search; jury; property; document; subpoena; Trump"             
##  [58] "season; sport; team; champion; win; ownership"                 
##  [59] "collapse; arrest; fraud; capital; charge; believe"             
##  [60] "terrorism; terrorist; message; fighter; prosecutor; bureau"    
##  [61] "commission; officer; spokesman; police; recommend; violation"  
##  [62] "worker; district; union; student; school; wage"                
##  [63] "lawmaker; questioning; question; pose; server; answer"         
##  [64] "interview; senator; campaign; letter; aide; solicit"           
##  [65] "motion; charge; bond; miss; document; case"                    
##  [66] "surname; court; election; defamation; leader; speech"          
##  [67] "flight; airline; pilot; crew; aircraft; assistance"            
##  [68] "border; crossing; agreement; entry; migrants; port"            
##  [69] "migration; asylum; country; pose; continue; gang"              
##  [70] "care; gender; affirme; therapy; suicide; treatment"            
##  [71] "care; affirme; rule; gender; bans; minor"                      
##  [72] "school; student; shooting; gun; board; police"                 
##  [73] "request; review; letter; information; committee; investigation"
##  [74] "table; mom; think; baby; joke; feel"                           
##  [75] "abortion; debate; decide; endorsement; state; opponent"        
##  [76] "veto; overturn; investment; rule; resolution; retirement"      
##  [77] "hour; union; contract; pay; raise; cast"                       
##  [78] "season; tv; streaming; show; finish; story"                    
##  [79] "traffic; air; staff; airport; memo; airline"                   
##  [80] "police; shoot; teenager; condition; hospital; victim"          
##  [81] "plaintiff; court; athlete; tran; policy; suit"                 
##  [82] "episode; character; backlash; casting; define; helmet"         
##  [83] "bill; media; teens; legislation; parent; platform"             
##  [84] "expansion; state; health; expand; hospital; benefit"           
##  [85] "officer; van; drag; fail; intervene; allege"                   
##  [86] "data; collect; concern; hearing; security; video"              
##  [87] "executive; election; lawyer; rig; theories; judge"             
##  [88] "dispatcher; child; rescue; boy; crawl; responde"               
##  [89] "incidence; case; people; report; pandemic; end"                
##  [90] "player; game; appearance; partnership; speed; write"           
##  [91] "son; parent; opinion; gun; text; contemplate"                  
##  [92] "wait; budget; category; customer; time; improve"               
##  [93] "retirement; age; worker; protest; benefit; raise"              
##  [94] "mystery; series; viewer; show; concrete; Yellowjackets"        
##  [95] "drug; milligram; prescription; death; keep; contain"           
##  [96] "committee; subpoena; chilling; item; share; produce"           
##  [97] "firefighter; train; chemical; car; information; accident"      
##  [98] "conspiracy; lie; news; network; lawsuit; spread"               
##  [99] "memo; threat; report; school; enforcement; claim"              
## [100] "prosecut; serve; offense; end; corruption; charge"

Can you get a sense of the article topics but just looking at the top words?