Today, we are going to run through a number of different examples of creating data sets through API calls. For the most part, these will make queries that return JSON data and require some form of iteration to process them. I will show you two in the notes and two others in the notebook.
If you have no prior programming experience, this may be a bit overwhelming, but that’s okay. I will make sure you have all of the API interface code you actually need for Project 4 already outlined for you.
Let start start by using an API provided by Wikipedia. We will use this more in the project, but here we focus just on using the API to see how many times a Wikipedia page has been viewed in the past month.
We will start with defining a base URL using the protocol, authority, path, and query parameters.
<- modify_url(
url_base "https://en.wikipedia.org/w/api.php",
query = list(action = "query", format = "json", prop = "pageviews")
)
To fetch a particular page, we need add an additional parameter called titles that contains the page title you want to grab. We can then return and parse the HTTP response as JSON.
<- modify_url(url_base, query = list(titles = "ChatGPT"))
url_str <- dsst_cache_get(url_str, cache_dir = "cache", force = FALSE)
res <- content(res, type = "application/json") obj
The shape of obj
is fairly complex. We could walk
through it using the names function, dollar sign, and bracket operator.
In RStudio, you can look at the object easily by clicking on it in the
Environment Panel. With some experimentation, you can see that the core
of the data we want is nested here:
<- obj$query$pages[[1]]$pageviews
pviews pviews
## $`2023-01-24`
## [1] 246519
##
## $`2023-01-25`
## [1] 238953
##
## $`2023-01-26`
## [1] 240126
##
## $`2023-01-27`
## [1] 239290
##
## $`2023-01-28`
## [1] 215210
##
## $`2023-01-29`
## [1] 220959
##
## $`2023-01-30`
## [1] 283857
##
## $`2023-01-31`
## [1] 278906
##
## $`2023-02-01`
## [1] 257708
##
## $`2023-02-02`
## [1] 267112
##
## $`2023-02-03`
## [1] 264945
##
## $`2023-02-04`
## [1] 211780
##
## $`2023-02-05`
## [1] 209862
##
## $`2023-02-06`
## [1] 249893
##
## $`2023-02-07`
## [1] 301250
##
## $`2023-02-08`
## [1] 299810
##
## $`2023-02-09`
## [1] 310804
##
## $`2023-02-10`
## [1] 255094
##
## $`2023-02-11`
## [1] 241040
##
## $`2023-02-12`
## [1] 227644
##
## $`2023-02-13`
## [1] 257447
##
## $`2023-02-14`
## [1] 204424
##
## $`2023-02-15`
## [1] 234607
##
## $`2023-02-16`
## [1] 212979
##
## $`2023-02-17`
## [1] 176821
##
## $`2023-02-18`
## [1] 147754
##
## $`2023-02-19`
## [1] 149091
##
## $`2023-02-20`
## [1] 164983
##
## $`2023-02-21`
## [1] 187112
##
## $`2023-02-22`
## [1] 183686
##
## $`2023-02-23`
## [1] 169804
##
## $`2023-02-24`
## [1] 139778
##
## $`2023-02-25`
## [1] 126296
##
## $`2023-02-26`
## [1] 133353
##
## $`2023-02-27`
## [1] 193268
##
## $`2023-02-28`
## [1] 169112
##
## $`2023-03-01`
## [1] 164781
##
## $`2023-03-02`
## [1] 175421
##
## $`2023-03-03`
## [1] 168276
##
## $`2023-03-04`
## [1] 149902
##
## $`2023-03-05`
## [1] 158610
##
## $`2023-03-06`
## [1] 196390
##
## $`2023-03-07`
## [1] 204225
##
## $`2023-03-08`
## [1] 207799
##
## $`2023-03-09`
## [1] 204618
##
## $`2023-03-10`
## [1] 185575
##
## $`2023-03-11`
## [1] 154763
##
## $`2023-03-12`
## [1] 159093
##
## $`2023-03-13`
## [1] 190348
##
## $`2023-03-14`
## [1] 198567
##
## $`2023-03-15`
## [1] 238583
##
## $`2023-03-16`
## [1] 223578
##
## $`2023-03-17`
## [1] 204472
##
## $`2023-03-18`
## [1] 157222
##
## $`2023-03-19`
## [1] 172500
##
## $`2023-03-20`
## [1] 301553
##
## $`2023-03-21`
## [1] 283419
##
## $`2023-03-22`
## [1] 284092
##
## $`2023-03-23`
## [1] 280056
##
## $`2023-03-24`
## NULL
We can turn this into a rectangular data set by using the functions found in the slides.
<- tibble(
dt page = "apple",
date = ymd(names(pviews)),
views = map_int(pviews, ~ dsst_null_to_na(..1))
) dt
## # A tibble: 60 × 3
## page date views
## <chr> <date> <int>
## 1 apple 2023-01-24 246519
## 2 apple 2023-01-25 238953
## 3 apple 2023-01-26 240126
## 4 apple 2023-01-27 239290
## 5 apple 2023-01-28 215210
## 6 apple 2023-01-29 220959
## 7 apple 2023-01-30 283857
## 8 apple 2023-01-31 278906
## 9 apple 2023-02-01 257708
## 10 apple 2023-02-02 267112
## # … with 50 more rows
We can then plot the data and see if there are any interesting patterns:
%>%
dt filter(!is.na(views)) %>%
mutate(wt = if_else(wday(date, week_start = 1) > 5, "weekend", "weekday")) %>%
ggplot(aes(date, views)) +
geom_line() +
geom_point(aes(color = wt), size = 2, show.legend = FALSE) +
scale_x_date(
date_breaks = "1 week",
date_labels = "%m/%d",
date_minor_breaks = "1 week"
+
) scale_color_viridis_d(begin = 0.2, end = 0.7) +
labs(x = "Views", y = "Date") +
dsst_tufte()
As a second task for the notes, let’s create a data set from the API for the popular web comic XKCD. The API has a slightly different structure because the information about the kind of data we want to grab is stored in the path rather than the query parameters. To get the URL for the 10th comic from XKCD, for example, we need this code:
<- 10
i <- modify_url(sprintf("https://xkcd.com/%d/info.0.json", i)) url_str
In the following code, create a data set for the first 25 webcomes that has the year, month, day, safe title, and transcript of the comics. Notice that I am using some new code to store the data more easily within each loop.
<- 25
n <- vector("list", length = n)
output
for (i in seq_len(n))
{<- modify_url(sprintf("https://xkcd.com/%d/info.0.json", i))
url_str
<- dsst_cache_get(url_str, cache_dir = "cache")
res
<- content(res, type = "application/json")
obj <- tibble(
output[[i]] year = obj$year,
month = obj$month,
day = obj$day,
safe_title = obj$safe_title,
transcript = obj$transcript
)
}
<- bind_rows(output)
output output
## # A tibble: 25 × 5
## year month day safe_title trans…¹
## <chr> <chr> <chr> <chr> <chr>
## 1 2006 1 1 Barrel - Part 1 "[[A b…
## 2 2006 1 1 Petit Trees (sketch) "[[Two…
## 3 2006 1 1 Island (sketch) "[[A s…
## 4 2006 1 1 Landscape (sketch) "[[A s…
## 5 2006 1 1 Blown apart "[[A b…
## 6 2006 1 1 Irony "Narra…
## 7 2006 1 1 Girl sleeping (Sketch -- 11th grade Spanish cl… "[[Gir…
## 8 2006 1 1 Red spiders "[[Man…
## 9 2006 1 1 Serenity is coming out tomorrow "[[Sev…
## 10 2006 1 1 Pi Equals "Pi = …
## # … with 15 more rows, and abbreviated variable name ¹transcript
Note that it should be trivial to adjust your code to get all of the comics as long as you know the number of the most recent comic. We are stopping at 25 to reduce our load on their servers and also to avoid a long run.