cleanNLP 2.0: Quickstart Guide
25 January 2018
The cleanNLP package takes raw text and outputs a structured representation of the text annotated with auto-generated annotations. These annotations capture elements of the text such as detecting word boundaries, giving base form of words into a base form (i.e., ‘dog’ is the base of the word ‘dogs’), and identifying parts of speech. This document gives a quick guide to getting started with the recently released version (2.0.3) package. Amongst the changes in the new version is the inclusion of the udpipe backend, which allows for parsing text without the need to install Python or Java.
Step 1: Install package
First, install the latest version of the cleanNLP package:
You’ll need to have a version of cleanNLP >2.0 to follow along the rest of this blog post.
Step 2: Initialize the udpipe backend
We will be using the excellent udpipe backend for this quickstart because it requires no external dependencies, is quite fast, and can produce the majority of available annotation tasks. Let’s load the package and initialize the backend:
The first time you run this, R will download a 16Mb file. It will be stored automatically between R sessions.
Step 3: Format the input data
There are many formats for inputing data to cleanNLP package. We will use the simplest format that consists of simply taking a character vector containing one element per document. Here I will create a small input vector of three well-known quotes:
You can, of course, create your own input data in any number of formats. We are now ready to annotate text with cleanNLP.
Step 4: Annotate the text
Now we a ready to annotate the input text with the function
The output is an annotation object; there are many things you can do with the annotation object, but for most users a good starting place is to turn it into a single data frame:
Each row in the output corresponds to a single word in the original
id column tells us which document the word came
from and other columns give the annotated information about each word.
Step 5: Share and enjoy
The output object above is a data frame and can be stored as a csv file, manipulated, and plotted just as any other data frame can in R. Check out the next section for more details about how the annotation process outlined above can be used and customized for your own needs.
This guide is meant to show a simple way to get started with the cleanNLP package. Here are some ideas for moving from this guide to making the best of what the package has to offer:
- if you work with non-English language, look at the help pages
cnlp_init_udpipeto see how to initialize other natural languages
- for named entities and word vectors, look at using the spacy backend
in place of udpipe:
cnlp_init_spacy. It takes a bit more more to set up, however, as it requires Python and several Python modules.
- look into the structure of the raw annotation object; it consists of a list of normalized data frames that can be filtered and joined to produce sophisticated analyses of data
Future blog posts will highlight particular applications of the package to specific textual corpora.