# Class 18: Strings in R

## Basic string manipulation

Today we will cover the main aspects of working with raw strings in R using the stringi package. To load the package call:

The main advantages of this package over this package compared to those in base-R are:

• consistent syntax - the string you are operating on is always the first element and functions all start with stri_
• great support for non-latin character sets and proper UTF-8 handling
• in some cases much faster than alternatives

We will work with two datasets that come pre-installed with stringr (a wrapper around stringi), a list of common English tokens named words and a list of short sentences named sentences. We will wrap these up as data frames in order to make them usable by the dplyr verbs we have been learning:

### stri_sub

The first function we will look at is stri_sub that takes a substring of each input by position; for example the following finds the first three characters of every string in the data set of words:

Notice that R silently ignores the fact that the first word that has only one letter (it is returned as-is).

We can use negative values to begin at the end of the string (-1 is the last character, -2 the second to last and so on). So the last two characters can be grabbed with this:

### Other simple stringi functions

The function stri_length describes how many characters are in a string:

And the functions stri_trans_toupper and stri_trans_tolower do exactly as they describe:

We even have stri_trans_totitle to convert to title case:

## matching fixed strings

### stri_detect

A function that finds patterns is the function stri_detect, which returns either TRUE or FALSE for whether an element has a string withing in. We can use this conjunction with the filter command to find examples with a particular string in it:

### stri_count

Similarly stri_count tells us how often a sentence uses a particular string. For instance, how many times are the digraphs “th”, “ch”, and “sh” used in each sentence:

I took a substring of the first column to make it fit on the page.

## stri_replace_all

The function stri_replace_all replaces one pattern with another. Perhaps we want to replace all of those borning “e”’s with “ë”:

The function stri_replace without the “all” only replaces the first occurrence in each string. It is not usually as useful as the _all variant, but named to be consistent with other stringi functions.

## matching patterns

### patterns

Trying to use the previous functions with a fixed string can be useful, but the true strength of these functions come from their ability to accept a pattern known as a regular expression.

We don’t have time to cover these in great detail, but will show a few important examples. For a more complete description of regular expressions in R, see the pdf file here:

The first example we will use is the “.” symbol which matches any character.

So, for instance the following finds any time that we have the letters “w” and “s” separated by any third character. Can you find where this occurs in each line?

### anchors

Two other special characters are “^” and “\$”, called anchors. The first matches the start of a sentence and the second matches the end of a sentence. So, which words end with the letter “w”?

There is on other stringi function we did not mention earlier: stri_extract. Given a pattern it returns the string that matches it. This is not very useful without regular expression but with them is an invaluable tool.
If html is a string, this will replace all of the characters in html tags with a single space. We will use that in our lab today.