# Class 21: Working with Tokens

## NLP

We have been using basic string processing functions in stringi to perform basic web scraping and data manipulation tasks.

Today, we extend these ideas by using the tokenizers package and two functions I wrote and put into the smodels package to parse the actual text and extract meaningful information from it.

The basic idea of cleanNLP is to turn text into a data frame with one row per word. The basic usage is as follows:

Previously, when we scrapped data from Wikipedia we did not do anything with the raw text (though you should have in one of the associated labs). Here is how we could have grabbed the text from the Lagos page:

Here is the result (using stri_wrap just for display purposes):

Let’s now build a small example with just three cities:

### website text

And cycle over these to extract a text column in our dataset:

With term_list_to_df, we can extract the tokens from these three pages.

The id column from df can be used to join these tokens with the original token data:

### Finding top tokens

Let’s use our new grouping function to find the top words in each city page:

Two of the city names pop up, but the other words are just common, boring English terms. We can use a stopword list to remove these. I’ll grab a list here from the tidytext package

The anti_join function returns the first dataset with all rows matching the second removed.

This removes words that are common in a large corpus, but still leaves those words that are not particularly useful in this particular context such as “city” and the names of each specific city. We can build a better list using our data itself:

### top tokens

Which yields these much improved results:

Note that there are more than three entries for London and Saint Petersburg due to ties.

## Idea for using this

Here are some suggestions of how you can use these in the exploratory portion of your second project report:

• find top word or words for each location and plot on a map
• count number of words in each page and use this as metadata
• create a list of interesting words and use semi_join (the opposite of anti_join) to filter only those words that are on this list