Class 21: Working with Tokens

2017-11-09

library(readr)
library(ggplot2)
library(dplyr)
library(viridis)

NLP

We have been using basic string processing functions in stringi to perform basic web scraping and data manipulation tasks.

Today, we extend these ideas by using the tokenizers package and two functions I wrote and put into the smodels package to parse the actual text and extract meaningful information from it.

The basic idea of cleanNLP is to turn text into a data frame with one row per word. The basic usage is as follows:

library(smodels)
library(tokenizers)
library(stringi)

input <- c("Penguins are awesome.",
           "Birds that can swim!")
data <- term_list_to_df(tokenize_words(input))
data
## # A tibble: 7 x 2
##      id    token
##   <int>    <chr>
## 1     1 penguins
## 2     1      are
## 3     1  awesome
## 4     2    birds
## 5     2     that
## 6     2      can
## 7     2     swim

Previously, when we scrapped data from Wikipedia we did not do anything with the raw text (though you should have in one of the associated labs). Here is how we could have grabbed the text from the Lagos page:

url <- "https://en.wikipedia.org/wiki/Lagos"
wpage <- data_frame(line = readLines(url))
wpage <- filter(wpage, stri_detect(line, fixed = "<p>"))
wpage <- mutate(wpage,
        line = stri_replace_all(line, "", regex = "<[^>]+>"))

Here is the result (using stri_wrap just for display purposes):

text <- stri_flatten(wpage$line, collapse = " ")
stri_wrap(text)[1:10]
##  [1] "Lagos /ˈleɪɡɒs/[11] (Yoruba: Èkó) is a city in the Nigerian state"  
##  [2] "of Lagos. The city, with its adjoining conurbation, is the largest" 
##  [3] "in Nigeria, as well as on the African continent. It is one of the"  
##  [4] "fastest growing in the world,[12][13][14][15][16][17][18] and also" 
##  [5] "one of the most populous urban agglomerations.[19][20] Lagos is a"  
##  [6] "major financial centre in Africa; the megacity has the highest GDP,"
##  [7] "[4] and also houses one of the largest and busiest ports on the"    
##  [8] "continent.[21][22][23] Lagos initially emerged as a port city which"
##  [9] "originated on a collection of islands, which are contained in the"  
## [10] "present day Local Government Areas (LGAs) of Lagos Island, Eti-"

Let’s now build a small example with just three cities:

urls <- c("https://en.wikipedia.org/wiki/Lagos",
          "https://en.wikipedia.org/wiki/London",
          "https://en.wikipedia.org/wiki/Saint_Petersburg")

df <- data_frame(id = 1:3,
                 city = c("Lagos", "London",
                          "Saint Petersburg"))
df$text <- NA

website text

And cycle over these to extract a text column in our dataset:

for (i in 1:3) {
    wpage <- data_frame(line = readLines(urls[i]))
    wpage <- filter(wpage, stri_detect(line, fixed = "<p>"))
    wpage <- mutate(wpage,
        line = stri_replace_all(line, "", regex = "<[^>]+>"))

    df$text[i] <- stri_flatten(wpage$line, collapse = " ")
}
stri_length(df$text)
## [1] 31970 78569 72265

With term_list_to_df, we can extract the tokens from these three pages.

token_data <- term_list_to_df(tokenize_words(df$text))
token_data
## # A tibble: 29,737 x 2
##       id   token
##    <int>   <chr>
##  1     1   lagos
##  2     1 ˈleɪɡɒs
##  3     1      11
##  4     1  yoruba
##  5     1     èkó
##  6     1      is
##  7     1       a
##  8     1    city
##  9     1      in
## 10     1     the
## # ... with 29,727 more rows

The id column from df can be used to join these tokens with the original token data:

tokens <- left_join(token_data, select(df, -text), by = "id")

Finding top tokens

Let’s use our new grouping function to find the top words in each city page:

tokens %>%
  group_by(city, token) %>%
  count(city) %>%
  group_by(city) %>%
  top_n(n = 3, n)
## # A tibble: 9 x 3
## # Groups:   city [3]
##               city  token     n
##              <chr>  <chr> <int>
## 1            Lagos  lagos   199
## 2            Lagos     of   198
## 3            Lagos    the   417
## 4           London london   411
## 5           London     of   531
## 6           London    the  1053
## 7 Saint Petersburg    and   388
## 8 Saint Petersburg     of   495
## 9 Saint Petersburg    the  1020

Two of the city names pop up, but the other words are just common, boring English terms. We can use a stopword list to remove these. I’ll grab a list here from the tidytext package

library(tidytext)
data(stop_words)
stop_words
## # A tibble: 1,149 x 2
##           word lexicon
##          <chr>   <chr>
##  1           a   SMART
##  2         a's   SMART
##  3        able   SMART
##  4       about   SMART
##  5       above   SMART
##  6   according   SMART
##  7 accordingly   SMART
##  8      across   SMART
##  9    actually   SMART
## 10       after   SMART
## # ... with 1,139 more rows

The anti_join function returns the first dataset with all rows matching the second removed.

tokens %>%
  anti_join(stop_words, by = c("token" = "word")) %>%
  group_by(city, token) %>%
  count(city) %>%
  group_by(city) %>%
  top_n(n = 3, n)
## # A tibble: 9 x 3
## # Groups:   city [3]
##               city      token     n
##              <chr>      <chr> <int>
## 1            Lagos       city    33
## 2            Lagos     island    35
## 3            Lagos      lagos   199
## 4           London        160    70
## 5           London       city   121
## 6           London     london   411
## 7 Saint Petersburg       city   141
## 8 Saint Petersburg petersburg   175
## 9 Saint Petersburg      saint   163

This removes words that are common in a large corpus, but still leaves those words that are not particularly useful in this particular context such as “city” and the names of each specific city. We can build a better list using our data itself:

custom_stop_words <- tokens %>%
  group_by(token) %>%
  count(sort = TRUE) %>%
  ungroup() %>%
  top_n(n = 300, n)

top tokens

Which yields these much improved results:

tokens %>%
  anti_join(custom_stop_words, by = "token") %>%
  count(city, token) %>%
  group_by(city) %>%
  top_n(n = 3, n)
## # A tibble: 19 x 3
## # Groups:   city [3]
##                city      token     n
##               <chr>      <chr> <int>
##  1            Lagos     africa     9
##  2            Lagos      ikoyi     9
##  3            Lagos   nigerian    10
##  4           London     celtic     8
##  5           London       deer     8
##  6           London       fire     8
##  7           London     forest     8
##  8           London      found    10
##  9           London    gallery     8
## 10           London    kingdom     8
## 11           London      outer     8
## 12           London    outside     8
## 13           London      roman     9
## 14 Saint Petersburg historical    10
## 15 Saint Petersburg  monuments    10
## 16 Saint Petersburg     oblast    11
## 17 Saint Petersburg   prospekt    10
## 18 Saint Petersburg       ussr    10
## 19 Saint Petersburg     winter    11

Note that there are more than three entries for London and Saint Petersburg due to ties.

Idea for using this

Here are some suggestions of how you can use these in the exploratory portion of your second project report:

  • find top word or words for each location and plot on a map
  • count number of words in each page and use this as metadata
  • create a list of interesting words and use semi_join (the opposite of anti_join) to filter only those words that are on this list