Let’s start with a relatively simple API that tells the current time in different time zones. This is probably not a good candidate for caching, but let’s use it anyway as an example. To start, we specify the URL using the protocol, authority, path, and query parameters.
<- modify_url(
url_str "https://www.timeapi.io/api/Time/current/zone",
query = list("timeZone" = "Europe/Amsterdam")
)
Next, we call the HTTP GET method using our cache wrapper function.
The result can be parsed as JSON using the content
function
and and appropriate type.
<- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
res <- content(res, type = "application/json") obj
The object obj
is a list object, in this case a set of
name/values pairs.
obj
## $year
## [1] 2023
##
## $month
## [1] 3
##
## $day
## [1] 25
##
## $hour
## [1] 14
##
## $minute
## [1] 53
##
## $seconds
## [1] 43
##
## $milliSeconds
## [1] 327
##
## $dateTime
## [1] "2023-03-25T14:53:43.3276041"
##
## $date
## [1] "03/25/2023"
##
## $time
## [1] "14:53"
##
## $timeZone
## [1] "Europe/Amsterdam"
##
## $dayOfWeek
## [1] "Saturday"
##
## $dstActive
## [1] FALSE
We can access any specific element using the dollar sign operator,
just as we do with the objects returned by functions such as
dsst_enet_build
.
$minute obj
## [1] 53
While API is usually used to describe access points designed specifically for programs to access data, we can use the same ideas to scrape data from a website. Your browser can be thought of as a program that uses an API to access data in the form HTML, CSS, and JavaScript. Take a moment to look at the CNN Lite website. We’ll try to grab data from this page from within R.
The “API” here is simple; it has no query parameters. The data that
is returned is in a markup language called HTML, so we change the type
of data that is returned by the content
function:
<- modify_url("https://lite.cnn.com/")
url_str <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
res <- content(res, type = "text/html", encoding = "UTF-8") obj
The object returned is a special type of R class that handles XML/HTML data.
obj
## {html_document}
## <html lang="en" data-layout-uri="cms.cnn.com/_layouts/layout-homepage/instances/cnnlite-v1@published">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=U ...
## [2] <body class="cnn">\n <header class="header--lite"><a href="/" cl ...
We’ll cover more details over the next few classes about how to use
XML/HTML objects. Here we’ll just dive into some examples. To start,
I’ll use the xml_find_all
function to find links (the tag
“a”) that are inside list items (the tag “li”). These return each of the
100 stories on the front page of the CNN lite website.
xml_find_all(obj, "..//li/a")
## {xml_nodeset (100)}
## [1] <a href="/2023/03/24/entertainment/gwyneth-paltrow-ski-collision-tri ...
## [2] <a href="/2023/03/24/politics/house-vote-parents-bill-of-rights-act" ...
## [3] <a href="/2023/03/24/health/eye-early-alzheimers-diagnosis-wellness" ...
## [4] <a href="/2023/03/24/opinions/indictment-might-end-up-helping-trump- ...
## [5] <a href="/2023/03/24/politics/evan-corcoran-testimony-documents-prob ...
## [6] <a href="/2023/03/24/economy/federal-reserve-unemployment-projection ...
## [7] <a href="/2023/03/24/health/colorectal-cancer-young-age-mystery">\n ...
## [8] <a href="/travel/article/airpods-tracked-down">\n This woman ...
## [9] <a href="/2023/03/24/health/eye-infection-patients">\n ‘Every ...
## [10] <a href="/2023/03/24/africa/paul-rusesabagina-released-rwanda-intl"> ...
## [11] <a href="/2023/03/24/us/baltimore-county-truck-crash">\n Truc ...
## [12] <a href="/2023/03/24/entertainment/yellowjackets-season-2-review">\n ...
## [13] <a href="/2023/03/24/tech/china-opposes-tiktok-sale-approval-needed- ...
## [14] <a href="/2023/03/24/us/denver-colorado-school-shooting-friday">\n ...
## [15] <a href="/2023/03/24/cars/eu-combustion-engine-debate-climate-intl"> ...
## [16] <a href="/2023/03/24/tech/twitter-verified-checkmarks">\n Pay ...
## [17] <a href="/2023/03/24/sport/sweet-16-fau-tennessee-upset-march-madnes ...
## [18] <a href="/2023/03/24/app-news-section/videos-of-the-week-mobile-marc ...
## [19] <a href="/2023/03/24/middleeast/israel-netanyahu-judicial-overhaul-i ...
## [20] <a href="/2023/03/24/sport/aurelien-sanchez-barkley-marathons-ultrar ...
## ...
We can extract the links to these pages using the function
xml_attr
and grabbing the “href” tag.
<- xml_find_all(obj, "..//li/a")
temp <- xml_attr(temp, "href")
links head(links)
## [1] "/2023/03/24/entertainment/gwyneth-paltrow-ski-collision-trial-friday"
## [2] "/2023/03/24/politics/house-vote-parents-bill-of-rights-act"
## [3] "/2023/03/24/health/eye-early-alzheimers-diagnosis-wellness"
## [4] "/2023/03/24/opinions/indictment-might-end-up-helping-trump-zelizer"
## [5] "/2023/03/24/politics/evan-corcoran-testimony-documents-probe"
## [6] "/2023/03/24/economy/federal-reserve-unemployment-projections"
Once we have the links, we can grab the actual content of a specific link. For example, here we grab the first page:
<- modify_url(paste0("https://lite.cnn.com/", links[1]))
url_str <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
res <- content(res, type = "text/html", encoding = "UTF-8") obj
Once again, the page is an HTML document.
obj
## {html_document}
## <html lang="en" data-layout-uri="cms.cnn.com/_layouts/layout-with-rail/instances/entertainment-article-v1@published">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=U ...
## [2] <body class="entertainment">\n <div>\n <header clas ...
From this, we can get each paragraph in the document:
xml_find_all(obj, "..//p[@class='paragraph--lite']")
## {xml_nodeset (48)}
## [1] <p class="paragraph--lite">\n Actress<a href="https://www.cnn.com/e ...
## [2] <p class="paragraph--lite">\n The actress and businesswoman has bee ...
## [3] <p class="paragraph--lite">\n Sanderson has accused Paltrow of cras ...
## [4] <p class="paragraph--lite">\n Paltrow <a href="https://www.cnn.com/ ...
## [5] <p class="paragraph--lite">\n The two have been in a legal battle f ...
## [6] <p class="paragraph--lite">\n Paltrow took the stand just before 3p ...
## [7] <p class="paragraph--lite">\n The Goop founder said on the day of t ...
## [8] <p class="paragraph--lite">\n The trip, Paltrow said, was the first ...
## [9] <p class="paragraph--lite">\n The collision happened on the first d ...
## [10] <p class="paragraph--lite">\n All of the children, who are around t ...
## [11] <p class="paragraph--lite">\n When asked if she was a good tipper, ...
## [12] <p class="paragraph--lite">\n Paltrow then repeated her assertion t ...
## [13] <p class="paragraph--lite">\n “He struck me in the back, yes, that ...
## [14] <p class="paragraph--lite">\n At one point, VanOrman had hoped to u ...
## [15] <p class="paragraph--lite">\n Paltrow testified that she was trying ...
## [16] <p class="paragraph--lite">\n She testified Friday that two skis ca ...
## [17] <p class="paragraph--lite">\n VanOrman walked around the courtroom ...
## [18] <p class="paragraph--lite">\n Paltrow said that at one point during ...
## [19] <p class="paragraph--lite">\n They both came crashing down together ...
## [20] <p class="paragraph--lite">\n Paltrow said she did not ask about th ...
## ...
And with a little bit of cleaning, we have the full text of the article in R:
<- xml_find_all(obj, "..//p[@class='paragraph--lite']")
text <- xml_text(text)
text <- stri_trim(text)
text head(text)
## [1] "Actress Gwyneth Paltrow took the stand to testify on Friday in a Utah trial over a 2016 snow skiing accident, painting a picture of her version of events relating to the incident over roughly two hours of testimony."
## [2] "The actress and businesswoman has been present in the courtroom since the trial began on Tuesday when lawyers representing Paltrow and Terry Sanderson, a 76-year-old retired optometrist, presented their opening statements to a seated jury."
## [3] "Sanderson has accused Paltrow of crashing into him and causing him lasting injuries and brain damage while they were both skiing on a beginner’s run on a Utah mountain in February of 2016. Sanderson also claims Paltrow and her ski instructor skied away after the incident without getting him medical care."
## [4] "Paltrow filed a countersuit against Sanderson in 2019 claiming that he skied into her."
## [5] "The two have been in a legal battle for seven years."
## [6] "Paltrow took the stand just before 3pm local time and was questioned by Sanderson’s attorney, Kristin A. VanOrman."
We can use a for loop to cycle over each of the links and store the text from all 100 stories. Let’s try to do that now!
<- rep("", length(links))
text_all for (j in seq_along(links))
{<- modify_url(paste0("https://lite.cnn.com/", links[j]))
url_str <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
res <- content(res, type = "text/html", encoding = "UTF-8")
obj
<- xml_find_all(obj, "..//p[@class='paragraph--lite']")
text <- xml_text(text)
text <- stri_trim(text)
text
<- paste0(text, collapse = " ")
text_all[j] }
Now, we can create a dataset that looks a lot like the
docs
tables that we have been working with all
semester:
<- tibble(
docs doc_id = sprintf("doc%04d", seq_along(links)),
train_id = "train",
text = text_all
) docs
## # A tibble: 100 × 3
## doc_id train_id text
## <chr> <chr> <chr>
## 1 doc0001 train Actress Gwyneth Paltrow took the stand to testify on Fr…
## 2 doc0002 train The House voted Friday to pass a controversial bill tha…
## 3 doc0003 train The eyes are more than a window to the soul — they’re a…
## 4 doc0004 train The news of a potential indictment would likely derail …
## 5 doc0005 train Evan Corcoran, Donald Trump’s primary defense attorney,…
## 6 doc0006 train One of the biggest unknowns since the Federal Reserve s…
## 7 doc0007 train Nikki Lawson received the shock of her life at age 35. …
## 8 doc0008 train We’ve had people tracking their bags when airlines can’…
## 9 doc0009 train Renee Martray of South Carolina has severe and permanen…
## 10 doc0010 train Paul Rusesabagina, who inspired the Hollywood film “Hot…
## # … with 90 more rows
Now, we need to create the anno
table from the
docs
table. In the past I have given this to you, but this
time you will have to make it yourself. The algorithm I used requires
setting up Python, which is more trouble than it is worth for one class
project. Let’s instead use a C-based algorithm that requires no
additional setup.
Here is the code to run the annotations over the documents. We will also remove any empty documents in the process, which can cause bugs later on.
library(cleanNLP)
cnlp_init_udpipe("english")
<- filter(docs, stringi::stri_length(text) > 0)
docs <- cnlp_annotate(docs)$token anno
## Processed document 10 of 100
## Processed document 20 of 100
## Processed document 30 of 100
## Processed document 40 of 100
## Processed document 50 of 100
## Processed document 60 of 100
## Processed document 70 of 100
## Processed document 80 of 100
## Processed document 90 of 100
## Processed document 100 of 100
The annotation process takes some time, but shouldn’t be too bad with only 100 short documents.
Now, we can use all of the functions we have had in class on the data. There is no straightforward predictive task, but we can use any of the unsupervised algorithms to study the data. For example, here are the words with the highest G-scores associated with each news article:
%>%
anno filter(upos %in% c("NOUN", "VERB")) %>%
dsst_metrics(docs, label_var = "doc_id") %>%
filter(count > expected) %>%
group_by(label) %>%
slice_head(n = 6L) %>%
summarize(terms = paste0(token, collapse = "; ")) %>%
getElement("terms")
## [1] "collision; ski; testify; accident; damage; stand"
## [2] "school; parent; classroom; bill; vote; child"
## [3] "disease; study; cell; brain; eye; decline"
## [4] "indictment; Trump; candidate; supporter; voter; persona"
## [5] "jury; document; search; prosecutor; probe; subpoena"
## [6] "banking; stability; rate; economist; crisis; sector"
## [7] "cancer; patient; adult; factor; weight; rise"
## [8] "plane; employee; detective; husband; airport; track"
## [9] "eye; infection; outbreak; vision; tear; drop"
## [10] "official; release; government; aide; sentence; family"
## [11] "fire; crash; department; explosion; fuel; diesel"
## [12] "season; character; mystery; other; alive; asham"
## [13] "sale; algorithm; technology; recommendation; accordance; force"
## [14] "school; police; officer; teacher; student; shoot"
## [15] "climate; fuel; car; fleet; exception; allow"
## [16] "tweet; account; program; user; revenue; company"
## [17] "game; victory; edge; half; play; man"
## [18] "pilot; history; briefe; caring; celebrate; play"
## [19] "conflict; interest; ally; violate; speech; nation"
## [20] "race; course; finish; sleep; mile; loops"
## [21] "attack; coalition; troops; target; carry; drone"
## [22] "strike; traveler; protest; visit; destination; disruption"
## [23] "murder; identify; killer; investigator; find; know"
## [24] "rate; inflation; home; buyer; slow; yield"
## [25] "protester; pension; government; reform; police; retirement"
## [26] "child; diagnose; identification; prevalence; trend; detection"
## [27] "visit; postpone; pension; reform; confirm; travel"
## [28] "opposition; democracy; party; conviction; disqualify; low"
## [29] "bank; banks; investor; bond; market; index"
## [30] "app; user; filter; platform; video; difference"
## [31] "city; force; deport; troops; town; region"
## [32] "data; security; app; information; collect; user"
## [33] "storm; expect; flood; rain; watch; wind"
## [34] "jury; attorney; source; witness; prosecutor; hear"
## [35] "gear; film; production; show; series; length"
## [36] "floor; pavement; shoe; visitor; walk; ceremony"
## [37] "authority; check; operate; company; target; background"
## [38] "leak; water; compound; contain; monitor; milligram"
## [39] "wave; art; copy; today; produce; sell"
## [40] "record; win; penalty; score; goal; glory"
## [41] "team; pride; player; night; community; celebration"
## [42] "testify; ski; Paltrow; Plaintiff; injury; videotaped"
## [43] "flash; flooding; flood; rain; soil; water"
## [44] "woman; transgender; athlete; tran; sport; advantage"
## [45] "abuse; detention; survivor; rights; prison; camp"
## [46] "manufacturing; cabinet; highlight; infrastructure; stop; week"
## [47] "eat; disorder; community; faith; health; illness"
## [48] "test; missile; cruise; drone; weapon; analyst"
## [49] "water; claim; statement; presence; dispute; operation"
## [50] "minister; overhaul; declare; law; settlement; sit"
## [51] "ubs; bank; credit; deal; Suisse; franc"
## [52] "arm; war; relationship; representative; weapon; defense"
## [53] "zoo; officer; fire; arrive; animal; escape"
## [54] "vehicle; driver; zone; police; crash; state"
## [55] "algorithm; technology; sale; regulator; data; recommendation"
## [56] "lawmaker; million; firm; own; hearing; value"
## [57] "search; jury; property; document; subpoena; Trump"
## [58] "season; sport; team; champion; win; ownership"
## [59] "collapse; arrest; fraud; capital; charge; believe"
## [60] "terrorism; terrorist; message; fighter; prosecutor; bureau"
## [61] "commission; officer; spokesman; police; recommend; violation"
## [62] "worker; district; union; student; school; wage"
## [63] "lawmaker; questioning; question; pose; server; answer"
## [64] "interview; senator; campaign; letter; aide; solicit"
## [65] "motion; charge; bond; miss; document; case"
## [66] "surname; court; election; defamation; leader; speech"
## [67] "flight; airline; pilot; crew; aircraft; assistance"
## [68] "border; crossing; agreement; entry; migrants; port"
## [69] "migration; asylum; country; pose; continue; gang"
## [70] "care; gender; affirme; therapy; suicide; treatment"
## [71] "care; affirme; rule; gender; bans; minor"
## [72] "school; student; shooting; gun; board; police"
## [73] "request; review; letter; information; committee; investigation"
## [74] "table; mom; think; baby; joke; feel"
## [75] "abortion; debate; decide; endorsement; state; opponent"
## [76] "veto; overturn; investment; rule; resolution; retirement"
## [77] "hour; union; contract; pay; raise; cast"
## [78] "season; tv; streaming; show; finish; story"
## [79] "traffic; air; staff; airport; memo; airline"
## [80] "police; shoot; teenager; condition; hospital; victim"
## [81] "plaintiff; court; athlete; tran; policy; suit"
## [82] "episode; character; backlash; casting; define; helmet"
## [83] "bill; media; teens; legislation; parent; platform"
## [84] "expansion; state; health; expand; hospital; benefit"
## [85] "officer; van; drag; fail; intervene; allege"
## [86] "data; collect; concern; hearing; security; video"
## [87] "executive; election; lawyer; rig; theories; judge"
## [88] "dispatcher; child; rescue; boy; crawl; responde"
## [89] "incidence; case; people; report; pandemic; end"
## [90] "player; game; appearance; partnership; speed; write"
## [91] "son; parent; opinion; gun; text; contemplate"
## [92] "wait; budget; category; customer; time; improve"
## [93] "retirement; age; worker; protest; benefit; raise"
## [94] "mystery; series; viewer; show; concrete; Yellowjackets"
## [95] "drug; milligram; prescription; death; keep; contain"
## [96] "committee; subpoena; chilling; item; share; produce"
## [97] "firefighter; train; chemical; car; information; accident"
## [98] "conspiracy; lie; news; network; lawsuit; spread"
## [99] "memo; threat; report; school; enforcement; claim"
## [100] "prosecut; serve; offense; end; corruption; charge"
Can you get a sense of the article topics but just looking at the top words?