Today we will cover the main aspects of working with raw
strings in R using the stringi package. To load
the package call:
The main advantages of this package over this package compared
to those in base-R are:
consistent syntax - the string you are operating on is always
the first element and functions all start with stri_
great support for non-latin character sets and proper UTF-8
in some cases much faster than alternatives
We will work with two datasets that come pre-installed with
stringr (a wrapper around stringi), a list of common
English tokens named words and a list of short sentences
named sentences. We will wrap these up as data frames in
order to make them usable by the dplyr verbs we have
The first function we will look at is stri_sub that takes
a substring of each input by position; for example the following
finds the first three characters of every string in the data
set of words:
Notice that R silently ignores the fact that the first word
that has only one letter (it is returned as-is).
We can use negative values to begin at the end of the string
(-1 is the last character, -2 the second to last and so on).
So the last two characters can be grabbed with this:
Other simple stringi functions
The function stri_length describes how many characters are
in a string:
And the functions stri_trans_toupper and stri_trans_tolower do exactly
as they describe:
We even have stri_trans_totitle to convert to title case:
matching fixed strings
A function that finds patterns is the function
stri_detect, which returns either TRUE or FALSE
for whether an element has a string withing in. We
can use this conjunction with the filter command
to find examples with a particular string in it:
Similarly stri_count tells us how often a sentence
uses a particular string. For instance, how many times
are the digraphs “th”, “ch”, and “sh” used in each
I took a substring of the first column to make it fit on
The function stri_replace_all replaces one pattern with
another. Perhaps we want to replace all of those borning
“e”’s with “ë”:
The function stri_replace without the “all” only replaces
the first occurrence in each string. It is not usually as
useful as the _all variant, but named to be consistent
with other stringi functions.
Trying to use the previous functions with a fixed string
can be useful, but the true strength of these functions
come from their ability to accept a pattern known as a
We don’t have time to cover these in great detail, but
will show a few important examples. For a more complete
description of regular expressions in R, see the pdf
The first example we will use is the “.” symbol which matches any character.
So, for instance the following finds any time that we have
the letters “w” and “s” separated by any third character.
Can you find where this occurs in each line?
Two other special characters are “^” and “$”,
called anchors. The first matches the start of a
sentence and the second matches the end of a sentence. So, which words
end with the letter “w”?
Or start with “sh”?
There is on other stringi function we did not mention
earlier: stri_extract. Given a pattern it returns the
string that matches it. This is not very useful without
regular expression but with them is an invaluable tool.
For example, what characters follow the pattern “th”?
There are many other more complex regular expressions. For
example, this one is very useful:
If html is a string, this will replace all of the characters
in html tags with a single space. We will use that in our lab