Tutorial 05: Regular Expressions [SOLUTIONS]

Here we introduce the concept of a regular expression and the Python module re, which provides an efficent and easy to use implemention of regular expressions.

Matching fixed strings

Regular expressions are used to identify patterns within string objects. Once found, these patterns can be used for tasks such as extraction, substitution, or splitting the string into parts.

We will introduce the basic concepts through substitution and show the other tasks at the end of the notebook.

To start, import the re module:

In [1]:
import re

We will use the function re.sub to replace all instances of a substring with another. Here, we will replace all spaces with dashes:

In [2]:
re.sub(" ", "-", "I am having fun with regular expressions! They are great!")
Out[2]:
'I-am-having-fun-with-regular-expressions!-They-are-great!'
In [3]:
re.sub("fun", "FUN", "I am having fun with regular expressions! They are great!")
Out[3]:
'I am having FUN with regular expressions! They are great!'

As we see, the first argument defines a pattern, the second the replacement, and the third the string to operate on. Used in sequence, substitutions can be used to clean character data:

In [4]:
msg = "I am having fun with regular expressions! They are great!"
msg = re.sub(" ", "-", msg)
msg = re.sub("!", "", msg)
msg
Out[4]:
'I-am-having-fun-with-regular-expressions-They-are-great'

Matching patterns

The power of regular expressions comes from the ability to match not just fixed strings but patterns of strings. There is a whole language of regular expressions; here I will show just a few of the most common examples.

The symbol + matches one or more of the prior characters. Take the example here:

In [5]:
msg = "aardvark?"
re.sub("a", "A", msg)
Out[5]:
'AArdvArk?'

And compare it to:

In [6]:
msg = "aardvark?"
re.sub("a+", "A", msg)
Out[6]:
'ArdvArk?'

The expression a+ matches both the letter "a" and the letter pair "aa" (regular expressions always find the largest matching string).

We can group letters together using braces, []. So to match any combination of numbers we can use this:

In [7]:
msg = "1000x 2341y 1104z"
re.sub("[0123456789]+", "NUMBER", msg)
Out[7]:
'NUMBERx NUMBERy NUMBERz'

This reads: "replace any sequence of digits with the string 'NUMBER'". There is a shortcut for this using the notation [0-9]. Similarly, [a-z] matches lowercase letters and [A-Z] matches upper case letters.

Finally, the symbol ^ stands for not. So [^a-z]+ stands for anything that is not a lower case letter:

In [8]:
re.sub("[^a-z]+", "", "I am having fun with regular expressions! They are great!")
Out[8]:
'amhavingfunwithregularexpressionsheyaregreat'

You may find that you want to match a character with a special meaning, such as the actual carrot symbol: ^. To do this, simply proceed the character with \\ to escape it in the string.

In [9]:
re.sub("\\^", "**", "2^3")
Out[9]:
'2**3'

Application: HTML

A very common application of regular expressions is to match HTML tags, which are contained between < and >. For example, <a href="python.org">. To match an html tag use this expression:

In [10]:
re.sub("<[^>]+>", "", "<a href='www.python.org'>click here!</a>")
Out[10]:
'click here!'

Can you figure out exactly how this expression works?

Find and split

As mentioned, there are other tasks we can do once a substring has been identified. We could, for example, split a string apart wherever a substring is detected using re.split:

In [11]:
re.split(" ", "I am having fun with regular expressions! They are great!")
Out[11]:
['I',
 'am',
 'having',
 'fun',
 'with',
 'regular',
 'expressions!',
 'They',
 'are',
 'great!']

Or, extract just the matching substrings using re.findall:

In [12]:
re.findall("<[^>]+>", "<a href='www.python.org'>click here!</a>")
Out[12]:
["<a href='www.python.org'>", '</a>']

Both of these functions return list objects, which we will see in the next notebook.


Practice

Basic application

For the next few practice exercises, we will use the string defined below, which comes from the Wikipedia page on "Data Science":

In [13]:
wiki = """Data science is an interdisciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights from data in
various forms, both structured and unstructured, similar to data mining.
Data science is a "concept to unify statistics, data analysis, machine learning
and their related methods" in order to "understand and analyze actual phenomena"
with data. It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, information science, and computer science."""

In the code block below, write a regular expresssion that replaces all spaces with underscores in the string wiki. Do not save the result, just print it out using the print function.

In [14]:
print(re.sub(" ", "_", wiki))
Data_science_is_an_interdisciplinary_field_that_uses_scientific_methods,
processes,_algorithms_and_systems_to_extract_knowledge_and_insights_from_data_in
various_forms,_both_structured_and_unstructured,_similar_to_data_mining.
Data_science_is_a_"concept_to_unify_statistics,_data_analysis,_machine_learning
and_their_related_methods"_in_order_to_"understand_and_analyze_actual_phenomena"
with_data._It_employs_techniques_and_theories_drawn_from_many_fields_within_the
context_of_mathematics,_statistics,_information_science,_and_computer_science.

Now, write code to remove all vowels from the string.

In [15]:
print(re.sub("[aeiou]", "", wiki))
Dt scnc s n ntrdscplnry fld tht ss scntfc mthds,
prcsss, lgrthms nd systms t xtrct knwldg nd nsghts frm dt n
vrs frms, bth strctrd nd nstrctrd, smlr t dt mnng.
Dt scnc s  "cncpt t nfy sttstcs, dt nlyss, mchn lrnng
nd thr rltd mthds" n rdr t "ndrstnd nd nlyz ctl phnmn"
wth dt. It mplys tchnqs nd thrs drwn frm mny flds wthn th
cntxt f mthmtcs, sttstcs, nfrmtn scnc, nd cmptr scnc.

Detecting words

Next, we want to find all words in the text. Write code to detect all words in the string wiki. I recommend again using print to show the results in an easy to view format.

In [16]:
print(re.findall("[a-zA-Z]+", wiki))
['Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', 'processes', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'data', 'in', 'various', 'forms', 'both', 'structured', 'and', 'unstructured', 'similar', 'to', 'data', 'mining', 'Data', 'science', 'is', 'a', 'concept', 'to', 'unify', 'statistics', 'data', 'analysis', 'machine', 'learning', 'and', 'their', 'related', 'methods', 'in', 'order', 'to', 'understand', 'and', 'analyze', 'actual', 'phenomena', 'with', 'data', 'It', 'employs', 'techniques', 'and', 'theories', 'drawn', 'from', 'many', 'fields', 'within', 'the', 'context', 'of', 'mathematics', 'statistics', 'information', 'science', 'and', 'computer', 'science']

If we look down a bit farther on the Wikipedia page, you'll see the following sentence.

In [17]:
wiki2 = """Even the suggestion that data science is sexy was a paraphrased
reference to Dr. Hans Rosling's 2011 BBC documentary quote, "Statistics, is
now the sexiest subject around"."""

The code you wrote above probably will not work on this string because it will miss the date '2011' as well as the apostophe. Modify the code to catch these edge cases and print the results.

In [18]:
print(re.findall("[a-zA-Z0-9']+", wiki2))
['Even', 'the', 'suggestion', 'that', 'data', 'science', 'is', 'sexy', 'was', 'a', 'paraphrased', 'reference', 'to', 'Dr', 'Hans', "Rosling's", '2011', 'BBC', 'documentary', 'quote', 'Statistics', 'is', 'now', 'the', 'sexiest', 'subject', 'around']

There is actually a better way of finding words in a string using the special marker \w, which matches any word-like character. It will not, though, detect apostophes (because these may or may not be part of a word). Modify the code you had above using the \w marker:

In [19]:
print(re.findall("[\w']+", wiki2))
['Even', 'the', 'suggestion', 'that', 'data', 'science', 'is', 'sexy', 'was', 'a', 'paraphrased', 'reference', 'to', 'Dr', 'Hans', "Rosling's", '2011', 'BBC', 'documentary', 'quote', 'Statistics', 'is', 'now', 'the', 'sexiest', 'subject', 'around']

One benefit of the \w character is that it can handle other languages. Take the French page on Science des données:

In [20]:
wiki_fr = """En termes généraux, la science des données est l'extraction
de connaissance d'ensembles de données"""

Try both your original code and the code with \w on wiki_fr:

In [21]:
print(re.findall("[a-zA-Z0-9']+", wiki_fr))
print(re.findall("[\w']+", wiki_fr))
['En', 'termes', 'g', 'n', 'raux', 'la', 'science', 'des', 'donn', 'es', 'est', "l'extraction", 'de', 'connaissance', "d'ensembles", 'de', 'donn', 'es']
['En', 'termes', 'généraux', 'la', 'science', 'des', 'données', 'est', "l'extraction", 'de', 'connaissance', "d'ensembles", 'de', 'données']

Which of them succesfully splits the string as you would expect?

Answer:

Finally, Take the start of the Chinese language page on "数据科学".

In [22]:
wiki_zh = """"数据科学,又称资料科学,是一门利用数据学习知识的学科,
其目标是通过从数据中提取出有价值的部分来生产数据产品。"""

Try both the original code and the one with \w on the string.

In [23]:
print(re.findall("[a-zA-Z0-9']+", wiki_zh))
print(re.findall("[\w']+", wiki_zh))
[]
['数据科学', '又称资料科学', '是一门利用数据学习知识的学科', '其目标是通过从数据中提取出有价值的部分来生产数据产品']

What happens? Does the \w character correctly detect the Chinese character for the end of the string? Does this code run the same on all computers in the class?

Answer: The \w character does not detect the Chinese because it knows that it signals a sentence boundary. I believe that this runs the same on all machines. Note that this regular expression works in the sense that it correctly detects Chinese characters as word characters; it does not, however, correctly split the string into words because you cannot do this in Chinese using regular expressions alone.

Trimming output

Consider a similar string containing the raw HTML code for the Wikipedia page:

In [24]:
wiki_html = """ <p>The term "data science" has appeared in various contexts over 
the past thirty years but did not become an established term until recently.
In an early usage it was used as  a substitute for
<a href="/wiki/Computer_science">computer science</a>
by <a href="/wiki/Peter_Naur">Peter Naur</a>  in 1960. """

I have already showed how to remove all of the html tags. Below, remove the HTML tags, overwrite the new string as wiki_html, and print out the result. If you make a mistake, just re-run the cell above to re-create the original string.

In [25]:
wiki_html = re.sub("<[^>]+>", "", wiki_html)
print(wiki_html)
 The term "data science" has appeared in various contexts over 
the past thirty years but did not become an established term until recently.
In an early usage it was used as  a substitute for
computer science
by Peter Naur  in 1960. 

There are also new line characters \n (or \r\n in Windows) in the string. Remove these as well and save the result as wiki_html again (Note: be careful here not to combine any words).

In [26]:
wiki_html = re.sub("[\n\r]", " ", wiki_html)
print(wiki_html)
 The term "data science" has appeared in various contexts over  the past thirty years but did not become an established term until recently. In an early usage it was used as  a substitute for computer science by Peter Naur  in 1960. 

You should notice that there is an errant extra space at the start of the string and end of the string. There are also several other extra spaces. These things commonly occurs when cleaning text with regular expressions. We could use the string method trim to handle some of these issues, but let's try to do something a bit more general.

There are several regular expression commands called anchors that can be used to detect the start of a string. For example, \A finds the start of the string and \Z the end. Using this, create a new version of wiki_html that removes the leading and trailing spaces (hint: this is easiest with two distinct regular expressions).

In [27]:
wiki_html = re.sub("\A ", "", wiki_html)
wiki_html
Out[27]:
'The term "data science" has appeared in various contexts over  the past thirty years but did not become an established term until recently. In an early usage it was used as  a substitute for computer science by Peter Naur  in 1960. '
In [28]:
wiki_html = re.sub(" \Z", "", wiki_html)
wiki_html
Out[28]:
'The term "data science" has appeared in various contexts over  the past thirty years but did not become an established term until recently. In an early usage it was used as  a substitute for computer science by Peter Naur  in 1960.'

Now with the start and stop cases handled, replace any sequence of spaces with just a single space.

In [29]:
wiki_html = re.sub("[ ]+", " ", wiki_html)
wiki_html
Out[29]:
'The term "data science" has appeared in various contexts over the past thirty years but did not become an established term until recently. In an early usage it was used as a substitute for computer science by Peter Naur in 1960.'

The result should now look reasonablely close to the same format as the text we started with in the prior section.

HTML tags

Finally, let's try a bit more of a challenge. Take the following string of HTML code once again:

In [30]:
wiki_html = """ <p>The term "data science" has appeared in various contexts over 
the past thirty years but did not become an established term until recently.
In an early usage it was used as  a substitute for
<a href="/wiki/Computer_science">computer science</a>
by <a href="/wiki/Peter_Naur">Peter Naur</a>  in 1960. """

I want to only extract the strings related to the link tags <a>. To do this, you'll need one more regular expression notation, defined by parentheses.

What if we want to match a regular expression, but only capture part of the return string? For example, say we want to find what words follow the word "in". We could find pairs of words like this:

In [31]:
re.findall("in \w+", wiki_html)
Out[31]:
['in various', 'in 1960']

But what if we just want the second word? To do that, put the part of the regular expression you want returned in parentheses, like this:

In [32]:
re.findall("in (\w+)", wiki_html)
Out[32]:
['various', '1960']

Now, using this notation, you can extract the links contained in the text wiki_html. That is, get the strings "/wiki/Computer_science" and "/wiki/Peter_Naur". In the space below, build a regular expression to find these strings with a single regular expression:

In [33]:
re.findall('<a href="(/wiki/\w+)', wiki_html)
Out[33]:
['/wiki/Computer_science', '/wiki/Peter_Naur']

Similarly, can you get the strings contained inside the tags? That is, find the strings "computer science" and "Peter Naur". Write the code for this below:

In [34]:
re.findall('<a href="/wiki/\w+">([^<]+)</a>', wiki_html)
Out[34]:
['computer science', 'Peter Naur']