For Project 4, unlike the other projects, you will be constructing
your own corpus of documents to work with. These documents will come
from the text of Wikipedia articles. You will build a corpus of
documents by starting with one or more pages that have thematic lists of
other pages, and then creating the corpus by selecting the list in
question. This project is important in two ways.First, you will see how
to build and annotate a corpus from scratch. Secondly, this data is of a
different format than others that we have seen. It has a much shorter
set of longer documents, more similar to what happens when we selected a
different doc_id
in the unsupervised learning tasks.
To start, we need to grab the data from a single Wikipedia page that has the links that we want to build a corpus from. We can do this, rather than scraping the page, through the MediaWiki API using the following:
<- modify_url(
url "https://en.wikipedia.org/w/api.php",
query = list(
action = "parse", format = "json", redirects = TRUE,
page = utils::URLdecode("Data_science")
) )
Now, let’s grab the page:
<- dsst_cache_get(url, cache_dir = "cache")
res <- content(res, type = "application/json") obj
As we have seen from the previous notes, the object returned by the API is rather complex. We can get a canonical title of the page with the following:
$parse$title obj
## [1] "Data science"
The actual text of the page is found in the first element of
obj$parse$text
and it’s in HTML format (yes, it’s XML
inside of JSON; a rare combination). We can parse it with the following,
which returns an HTML document as we saw in the previous notes.
<- xml2::read_html(obj$parse$text[[1]])
tree tree
## {html_document}
## <html>
## [1] <body><div class="mw-parser-output">\n<div class="shortdescription no ...
Let’s try to get all of the links in the body of the page using the HTML tags:
<- xml_find_all(tree, xpath = ".//p//a")
links <- xml_attr(links, "href")
links head(links)
## [1] "/wiki/Interdisciplinary" "#cite_note-1"
## [3] "/wiki/Statistics" "/wiki/Scientific_computing"
## [5] "/wiki/Scientific_method" "/wiki/Algorithm"
It looks like we need to do some cleaning of the page links, which we can do like this (these should work 99% of the time to clean links from Wikipedia):
<- links[stri_sub(links, 1L, 6L) == "/wiki/"]
links <- links[stri_sub(links, 1L, 16L) != "/wiki/Wikipedia:"]
links <- stri_sub(links, 7L, -1L)
links <- links[!stri_detect(links, fixed = "#")]
links <- unique(links)
links <- tibble(links = links)
links links
## # A tibble: 63 × 1
## links
## <chr>
## 1 Interdisciplinary
## 2 Statistics
## 3 Scientific_computing
## 4 Scientific_method
## 5 Algorithm
## 6 Knowledge
## 7 Unstructured_data
## 8 Data_analysis
## 9 Informatics
## 10 Phenomena
## # … with 53 more rows
Then, we want to cycle through the links, grab the text from each,
and then create a large docs
table. Looping through the
pages is no more complex than the examples we saw last week. The tricky
thing is the cleaning of the HTML data from each page to get usable
text. I’ve wrapped some of the custom rules that I have learned for
cleaning the data into the dsst_wiki_make_data
function.
I’ve also added a few extra cleaning functions that I have only recently
learned might be needed:
<- dsst_wiki_make_data(links, cache_dir = "cache")
docs <- mutate(docs, doc_id = stri_replace_all(doc_id, "", regex = "<[^>]+>"))
docs <- mutate(docs, text = stri_replace_all(text, " ", regex = "[\n]+"))
docs <- filter(docs, !duplicated(doc_id))
docs <- mutate(docs, train_id = "train")
docs docs
## # A tibble: 60 × 3
## doc_id text train…¹
## <chr> <chr> <chr>
## 1 Interdisciplinarity "Interdisciplinarity or interdisciplinary … train
## 2 Statistics "Statistics is the discipline that concern… train
## 3 Computational science "In practical use it is typically the appl… train
## 4 Scientific method "The scientific method is an empirical met… train
## 5 Algorithm "In mathematics and computer science an al… train
## 6 Knowledge "Knowledge is a form of awareness or famil… train
## 7 Unstructured data "Unstructured data is information that eit… train
## 8 Data analysis "Finite element Boundary element Lattice B… train
## 9 Informatics "Informatics is the study of computational… train
## 10 Phenomenon "A phenomenon sometimes spelled phaenomeno… train
## # … with 50 more rows, and abbreviated variable name ¹train_id
Now, we need to create the anno
table from the
docs
table as we did in last week’s notes. Here is the code
again.
library(cleanNLP)
cnlp_init_udpipe("english")
<- filter(docs, stringi::stri_length(text) > 0)
docs <- cnlp_annotate(docs)$token anno
The annotation process takes some time. Let’s save the
docs
and anno
table for next time.
write_csv(docs, file.path("..", "data", "wiki_data_science.csv"))
write_csv(anno, file.path("..", "data", "wiki_data_science.csv.gz"))
The saved results can then be read-in and used without having to re-create the data set. We will use this data in several of our upcoming notebooks.
Let’s look at one other example that illustrates a common trouble when using the above code. We will start with the “List_of_sovereign_states”, which has all the current, officially recognized, countries in the world.
<- modify_url(
url "https://en.wikipedia.org/w/api.php",
query = list(
action = "parse", format = "json", redirects = TRUE,
page = utils::URLdecode("List_of_sovereign_states"))
)<- dsst_cache_get(url, cache_dir = "cache")
res <- content(res, type = "application/json") obj
The difficulty here is that the links are contained inside a table
object. On its own, this would just require using a different value for
the xpath
argument (.//table//a
). However, in
this case we only want to get those links in the first column. This is a
bit trickier but fairly common. To fix this issue, I tried to write a
different function called dsst_wiki_get_links_table
that
allows you to grab a specific table and a specific column. There are a
lot of edge cases though, and you may need to look at the source
code and modify the code. Here is the function in action:
<- dsst_wiki_get_links_table(obj, table_num = 1L, column_num = 1L)
links links
## # A tibble: 207 × 1
## links
## <chr>
## 1 Member_states_of_the_United_Nations
## 2 Afghanistan
## 3 Albania
## 4 Algeria
## 5 Andorra
## 6 Angola
## 7 Antigua_and_Barbuda
## 8 Argentina
## 9 Armenia
## 10 Australia
## # … with 197 more rows
You will find that the often the function grabs one or two rows you do not want. We will remove these manually:
<- links[-1,] links
Once we have the links, we can make the pages from all the links just as before:
<- dsst_wiki_make_data(links, cache_dir = "cache")
docs <- mutate(docs, doc_id = stri_replace_all(doc_id, "", regex = "<[^>]+>"))
docs <- mutate(docs, text = stri_replace_all(text, " ", regex = "[\n]+"))
docs <- filter(docs, !duplicated(doc_id))
docs docs
## # A tibble: 206 × 2
## doc_id text
## <chr> <chr>
## 1 Afghanistan Afghanistan officially the Islamic Emirate of Afghan…
## 2 Albania Albania al--nee-ə; Albanian Shqipëri or officially t…
## 3 Algeria Algeria officially the Peoples Democratic Republic o…
## 4 Andorra Andorra officially the Principality of Andorra is a …
## 5 Angola Angola ; officially the Republic of Angola is a coun…
## 6 Antigua and Barbuda Antigua and Barbuda is a sovereign island country in…
## 7 Argentina Argentina officially the Argentine Republic is a cou…
## 8 Armenia Armenia officially the Republic of Armenia is a land…
## 9 Australia Australia officially the Commonwealth of Australia i…
## 10 Austria Austria formally the Republic of Austria is a landlo…
## # … with 196 more rows
And then make the annotation and save the results:
library(cleanNLP)
cnlp_init_udpipe("english")
<- cnlp_annotate(docs)$token
anno
write_csv(docs, file.path("..", "data", "wiki_list_of_sovereign_states.csv"))
write_csv(anno, file.path("..", "data", "wiki_list_of_sovereign_states_anno.csv.gz"))
We will use both of these datasets in several of our upcoming notebooks.