Creating Text Visualizations with Wikipedia Data

Taylor Arnold

This document shows the updated version 3 of the package, now available on CRAN

Grabbing the data

We start by using the MediaWiki API to grab page data from Wikipedia. We will wrap this up into a small function for re-use later, and start by looking at the English page for oenguins. The code converts the JSON data into XML data and takes only text within the body of the article.

grab_wiki <- function(lang, page) {
  url <- sprintf(
    "https://%s.wikipedia.org/w/api.php?action=parse&format=json&page=%s",
    lang,
    page)
  page_json <- jsonlite::fromJSON(url)$parse$text$"*"
  page_xml <- xml2::read_xml(page_json, asText=TRUE)
  page_text <- xml_text(xml_find_all(page_xml, "//div/p"))

  page_text <- stri_replace_all(page_text, "", regex="\\[[0-9]+\\]")
  page_text <- stri_replace_all(page_text, " ", regex="\n")
  page_text <- stri_replace_all(page_text, " ", regex="[ ]+")
  page_text <- page_text[stri_length(page_text) > 10]

  return(page_text)
}

penguin <- grab_wiki("en", "penguin")
penguin[1:10] # just show the first 10 paragraphs
##  [1] "Penguins (order Sphenisciformes, family Spheniscidae) are a group of aquatic flightless birds. They live almost exclusively in the Southern Hemisphere, with only one species, the Galapagos penguin, found north of the equator. Highly adapted for life in the water, penguins have countershaded dark and white plumage, and their wings have evolved into flippers. Most penguins feed on krill, fish, squid and other forms of sea life which they catch while swimming underwater. They spend roughly half of their lives on land and the other half in the sea. "                                                                                                                                                                                                                                                                                                               
##  [2] "Although almost all penguin species are native to the Southern Hemisphere, they are not found only in cold climates, such as Antarctica. In fact, only a few species of penguin live so far south. Several species are found in the temperate zone, and one species, the Galápagos penguin, lives near the equator. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
##  [3] "The largest living species is the emperor penguin (Aptenodytes forsteri): on average, adults are about 1.1 m (3 ft 7 in) tall and weigh 35 kg (77 lb). The smallest penguin species is the little blue penguin (Eudyptula minor), also known as the fairy penguin, which stands around 40 cm (16 in) tall and weighs 1 kg (2.2 lb). Among extant penguins, larger penguins inhabit colder regions, while smaller penguins are generally found in temperate or even tropical climates (see also Bergmann's rule). Some prehistoric species attained enormous sizes, becoming as tall or as heavy as an adult human. These were not restricted to Antarctic regions; on the contrary, subantarctic regions harboured high diversity, and at least one giant penguin occurred in a region around 2,000 km south of the equator 35 mya, in a climate decidedly warmer than today.[which?] "
##  [4] "The word penguin first appears in the 16th century as a synonym for great auk. When European explorers discovered what are today known as penguins in the Southern Hemisphere, they noticed their similar appearance to the great auk of the Northern Hemisphere, and named them after this bird, although they are not closely related. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [5] "The etymology of the word penguin is still debated. The English word is not apparently of French, Breton or Spanish origin (the latter two are attributed to the French word pingouin \"auk\"), but first appears in English or Dutch. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
##  [6] "Some dictionaries suggest a derivation from Welsh pen, \"head\" and gwyn, \"white\", including the Oxford English Dictionary, the American Heritage Dictionary, the Century Dictionary and Merriam-Webster, on the basis that the name was originally applied to the great auk, either because it was found on White Head Island (Welsh: Pen Gwyn) in Newfoundland, or because it had white circles around its eyes (though the head was black). "                                                                                                                                                                                                                                                                                                                                                                                                                                     
##  [7] "An alternative etymology links the word to Latin pinguis, which means \"fat\" or \"oil\". Support for this etymology can be found in the alternative Germanic word for penguin, Fettgans or \"fat-goose\", and the related Dutch word vetgans. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
##  [8] "Adult male penguins are called cocks, females hens; a group of penguins on land is a waddle, and a similar group in the water is a raft. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [9] "The number of extant penguin species is debated. Depending on which authority is followed, penguin biodiversity varies between 17 and 20 living species, all in the subfamily Spheniscinae. Some sources consider the white-flippered penguin a separate Eudyptula species, while others treat it as a subspecies of the little penguin; the actual situation seems to be more complicated. Similarly, it is still unclear whether the royal penguin is merely a colour morph of the macaroni penguin. The status of the rockhopper penguins is also unclear. "                                                                                                                                                                                                                                                                                                                        
## [10] "Updated after Marples (1962), Acosta Hospitaleche (2004), and Ksepka et al. (2006). "

Running the cleanNLP annotation

Next, we run the udpipe annotation backend over the dataset using cleanNLP. Because of the way the data are structured, each paragraph will be treated as its own document.

cnlp_init_udpipe()
anno <- cnlp_annotate(penguin, verbose=FALSE)
anno$token
## # A tibble: 5,457 x 11
##    doc_id   sid tid   token token_with_ws lemma upos  xpos  feats
##  *  <int> <int> <chr> <chr> <chr>         <chr> <chr> <chr> <chr>
##  1      1     1 1     Peng… "Penguins "   Peng… NOUN  NNS   Numb…
##  2      1     1 2     (     (             (     PUNCT -LRB- <NA> 
##  3      1     1 3     order "order "      order NOUN  NN    Numb…
##  4      1     1 4     Sphe… Spheniscifor… Sphe… NOUN  NNS   Numb…
##  5      1     1 5     ,     ", "          ,     PUNCT ,     <NA> 
##  6      1     1 6     fami… "family "     fami… NOUN  NN    Numb…
##  7      1     1 7     Sphe… Spheniscidae  Sphe… NOUN  NN    Numb…
##  8      1     1 8     )     ") "          )     PUNCT -RRB- <NA> 
##  9      1     1 9     are   "are "        be    AUX   VBP   Mood…
## 10      1     1 10    a     "a "          a     DET   DT    Defi…
## # … with 5,447 more rows, and 2 more variables: tid_source <chr>,
## #   relation <chr>

Reconstructing the text

Here, we will show how we can recreate the original text, possibly with additional markings. This can be useful when building text-based visualization pipelines. For example, let’s start by replacing all of the proper nouns with an all caps version of each word. This is easy because udpipe (and spacy as well) provides a column called token_with_ws:

token <- anno$token
token$new_token <- token$token_with_ws
change_these <- which(token$xpos %in% c("NNP", "NNPS"))
token$new_token[change_these] <- stri_trans_toupper(token$new_token[change_these])

Then, push all of the text back together by paragraph (we use the stri_wrap function to print out the text in a nice format for this document):

paragraphs <- tapply(token$new_token, token$doc_id, paste, collapse="")[1:10]
paragraphs <- stri_wrap(paragraphs, simplify=FALSE, exdent = 1)
cat(unlist(lapply(paragraphs, function(v) c(v, ""))), sep="\n")
## Penguins (order Sphenisciformes, family Spheniscidae) are a group
##  of aquatic flightless birds. They live almost exclusively in the
##  Southern Hemisphere, with only one species, the GALAPAGOS PENGUIN,
##  found north of the equator. Highly adapted for life in the water,
##  penguins have countershaded dark and white plumage, and their wings
##  have evolved into flippers. Most penguins feed on krill, fish,
##  squid and other forms of sea life which they catch while swimming
##  underwater. They spend roughly half of their lives on land and the
##  other half in the sea.
## 
## Although almost all penguin species are native to the Southern
##  Hemisphere, they are not found only in cold climates, such as
##  ANTARCTICA. In fact, only a few species of penguin live so far
##  south. Several species are found in the temperate zone, and one
##  species, the GALÁPAGOS penguin, lives near the equator.
## 
## The largest living species is the emperor penguin (Aptenodytes
##  forsteri): on average, adults are about 1.1 m (3 ft 7 in) tall and
##  weigh 35 kg (77 lb). The smallest penguin species is the little
##  blue penguin (EUDYPTULA MINOR), also known as the fairy penguin,
##  which stands around 40 cm (16 in) tall and weighs 1 kg (2.2 lb).
##  Among extant penguins, larger penguins inhabit colder regions,
##  while smaller penguins are generally found in temperate or even
##  tropical climates (see also BERGMANN's rule). Some prehistoric
##  species attained enormous sizes, becoming as tall or as heavy as
##  an adult human. These were not restricted to Antarctic regions; on
##  the contrary, subantarctic regions harboured high diversity, and at
##  least one giant penguin occurred in a region around 2,000 km south
##  of the equator 35 mya, in a climate decidedly warmer than today.
##  [which?]
## 
## The word penguin first appears in the 16th century as a synonym for
##  great auk. When European explorers discovered what are today known
##  as penguins in the SOUTHERN Hemisphere, they noticed their similar
##  appearance to the great auk of the Northern Hemisphere, and named
##  them after this bird, although they are not closely related.
## 
## The etymology of the word penguin is still debated. The English
##  word is not apparently of FRENCH, BRETON or Spanish origin (the
##  latter two are attributed to the French word pingouin "auk"), but
##  first appears in ENGLISH or DUTCH.
## 
## Some dictionaries suggest a derivation from WELSH PEN, "head"
##  and gwyn, "white", including the OXFORD ENGLISH DICTIONARY, the
##  American Heritage DICTIONARY, the CENTURY DICTIONARY and MERRIAM-
##  WEBSTER, on the basis that the name was originally applied to the
##  great auk, either because it was found on WHITE HEAD ISLAND (Welsh:
##  PEN GWYN) in NEWFOUNDLAND, or because it had white circles around
##  its eyes (though the head was black).
## 
## An alternative etymology links the word to LATIN PINGUIS, which
##  means "fat" or "oil". Support for this etymology can be found in
##  the alternative Germanic word for penguin, Fettgans or "fat-goose",
##  and the related Dutch word vetgans.
## 
## Adult male penguins are called cocks, females hens; a group of
##  penguins on land is a waddle, and a similar group in the water is a
##  raft.
## 
## The number of extant penguin species is debated. Depending on which
##  authority is followed, penguin biodiversity varies between 17 and
##  20 living species, all in the subfamily Spheniscinae. Some sources
##  consider the white-flippered penguin a separate EUDYPTULA species,
##  while others treat it as a subspecies of the little penguin; the
##  actual situation seems to be more complicated. Similarly, it is
##  still unclear whether the royal penguin is merely a colour morph of
##  the macaroni penguin. The status of the rockhopper penguins is also
##  unclear.
## 
## Updated after MARPLES (1962), ACOSTA HOSPITALECHE (2004), and
##  KSEPKA et al. (2006).

By outputting the text as HTML or XML, there is a lot of interesting visualization and metadata work that can be done with this approach. If you have an interesting use case that might be useful to others, please feel free to make a pull-request to include your work in the package repository.