Tutorial 06: Using Requests and Regular Expressions to Count Words

Here we use the requests library to actually grab web data from Wikipedia as an HTML page. Using the regular expressions you saw in the prior tutorial, you'll remove all of the special formatting and count the most frequent words found on the page. This will be our first chance to do something with the actual data from Wikipedia.

Modules

To start, we will load both the re module and the requests module. The second is what we will use to extract websites into Python.

In [1]:
import re
import requests

Making a request

We will start by all grabbing the Wikipedia webpage associated with the University of Richmond. At the end of the tutorial, you'll be able to grab a website of your own choosing. I suggest opening the Wikipedia page in another tab so that you can compare the website with the extracted code in Python.

To make a "request" using the requests module, we use the function get and pass it the full URL to the page, like this:

In [2]:
url = 'https://en.wikipedia.org/wiki/University_of_Richmond'
r = requests.get(url)
r
Out[2]:
<Response [200]>

You'll notice that the object that is returned, called r here, does not print out anything resembling the actual website. Instead, we just get a message that should say <Response [200]> (if not, you have a problem; perhaps a network connectivity issue). What this means is that the request was processed and returned the HTTP status code of 200. This indicates that the request was processed with a status of OK; more verbosely:

Standard response for successful HTTP requests. The actual response will depend on the request method used. In a GET request, the response will contain an entity corresponding to the requested resource. In a POST request, the response will contain an entity describing or containing the result of the action.

The response given by the website contains a number of elements. For example, there is the "HTTP header" that contains metadata about the HTTP request:

In [3]:
r.headers
Out[3]:
{'Date': 'Wed, 26 Sep 2018 13:30:16 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '36303', 'Connection': 'keep-alive', 'Server': 'mw2186.codfw.wmnet', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'X-Content-Type-Options': 'nosniff', 'P3P': 'CP="This is not a P3P policy! See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'X-Powered-By': 'HHVM/3.18.6-dev', 'Content-language': 'en', 'Last-Modified': 'Tue, 18 Sep 2018 00:34:20 GMT', 'Backend-Timing': 'D=87588 t=1537674113436458', 'Content-Encoding': 'gzip', 'X-Varnish': '144837023 662851547, 952575158 960807635', 'Via': '1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)', 'Age': '112900', 'X-Cache': 'cp2016 hit/7, cp2010 hit/21', 'X-Cache-Status': 'hit-front', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Set-Cookie': 'WMF-Last-Access=26-Sep-2018;Path=/;HttpOnly;secure;Expires=Sun, 28 Oct 2018 12:00:00 GMT, WMF-Last-Access-Global=26-Sep-2018;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Sun, 28 Oct 2018 12:00:00 GMT, GeoIP=US:VA:Richmond:37.55:-77.46:v4; Path=/; secure; Domain=.wikipedia.org', 'X-Analytics': 'ns=0;page_id=856511;https=1;nocookies=1', 'X-Client-IP': '108.4.68.106', 'Cache-Control': 'private, s-maxage=0, max-age=0, must-revalidate', 'Accept-Ranges': 'bytes'}

The part that we are most interested in, however, is the text of the response. This is given by the attribute text, which we can print out as follows:

In [5]:
print(r.text[:1000])
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>University of Richmond - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"University_of_Richmond","wgTitle":"University of Richmond","wgCurRevisionId":860024102,"wgRevisionId":860024102,"wgArticleId":856511,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: Julian–Gregorian uncertainty","All articles with dead external links","Articles with dead external links from August 2018","Articles with permanently dead external links","Instances of Infobox university using image size","Coordinates not on Wikidata","Educational institutions established in 183

The text above is written in a markup language called HTML; rendered in your browser it yields the pretty website that you are used to seeing when you navigate to Wikipedia.

In Python, this text is just stored as a very long string object similar to the strings you saw in Tutorial 3. We will now make use of string methods and regular expression functions to process the string and extract the individual words.

Cleaning HTML code

To start, we will save the request text as a variable called website. Just to simplify the processing, in the code below I have also remove the first set of lines corresponding the HTML header and an embedded Javascript chunk at the bottom of the text.

In [6]:
website = r.text
website = website[website.find("<body"):website.find("<noscript>")]
print(website[:1000])
<body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject page-University_of_Richmond rootpage-University_of_Richmond skin-vector action-view">		<div id="mw-page-base" class="noprint"></div>
		<div id="mw-head-base" class="noprint"></div>
		<div id="content" class="mw-body" role="main">
			<a id="top"></a>
			<div id="siteNotice" class="mw-body-content"><!-- CentralNotice --></div><div class="mw-indicators mw-body-content">
</div>
<h1 id="firstHeading" class="firstHeading" lang="en">University of Richmond</h1>			<div id="bodyContent" class="mw-body-content">
				<div id="siteSub" class="noprint">From Wikipedia, the free encyclopedia</div>				<div id="contentSub"></div>
				<div id="jump-to-nav"></div>				<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>
				<a class="mw-jump-link" href="#p-search">Jump to search</a>
				<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr"><div class="mw-parser-output"><table class="infobox vcard" style="width

Make sure that you scroll through some of the text below; the top is a bit noisy, but you should see the text of the page hidden within the HTML tags.

Now, it's your turn to start writing some code. In the block below overwrite the variable website by removing all HTML tags from the original string. Print out the website variable at the end of the block of code.

In [7]:
website = re.sub('<[^>]+>', '', website)
print(website[:1000])
		
		
		
			
			

University of Richmond			
				From Wikipedia, the free encyclopedia				
								Jump to navigation
				Jump to search
				University of Richmond
Motto
Verbum Vitae et Lumen Scientiae (Latin)Motto in&#160;English
Word of life and the light of knowledge &#91;1&#93;Type
PrivateEstablished
1830&#160;(1830)Endowment
$2.37 billion (2017)&#91;2&#93;President
Ronald CrutcherAcademic staff
612 (402 full-time, 210 part-time)&#91;3&#93;Students
4,131&#91;4&#93;Undergraduates
3,254 (3,052 full-time, 202 part-time)&#91;4&#93;Postgraduates
877 (500 full-time, 377 part-time)&#91;4&#93;Location
Richmond, Virginia, U.S.Campus
Suburban, 350 acres (1.4&#160;km2)Colors
UR Blue and UR Red&#91;5&#93;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;Athletics
NCAA Division I – A-10Nickname
SpidersAffiliations
ACSSURACICMascot
WebstUR the Spider&#91;6&#93;Website
www.richmond.edu

The University of Richmond (UR or U of R) is a private, nonsectarian, liberal arts college located in the ci

This should already look a lot closer to the raw text on the website. Recall that using the function print makes newline characters look nice. It also makes a TAB (represented by the symbol t) appear nicely. To see this, look at the raw string website by runing the code below; mentally compare to the printed version above.

In [8]:
website[:1000]
Out[8]:
'\t\t\n\t\t\n\t\t\n\t\t\t\n\t\t\t\n\nUniversity of Richmond\t\t\t\n\t\t\t\tFrom Wikipedia, the free encyclopedia\t\t\t\t\n\t\t\t\t\t\t\t\tJump to navigation\n\t\t\t\tJump to search\n\t\t\t\tUniversity of Richmond\nMotto\nVerbum Vitae et Lumen Scientiae (Latin)Motto in&#160;English\nWord of life and the light of knowledge &#91;1&#93;Type\nPrivateEstablished\n1830&#160;(1830)Endowment\n$2.37 billion (2017)&#91;2&#93;President\nRonald CrutcherAcademic staff\n612 (402 full-time, 210 part-time)&#91;3&#93;Students\n4,131&#91;4&#93;Undergraduates\n3,254 (3,052 full-time, 202 part-time)&#91;4&#93;Postgraduates\n877 (500 full-time, 377 part-time)&#91;4&#93;Location\nRichmond, Virginia, U.S.Campus\nSuburban, 350 acres (1.4&#160;km2)Colors\nUR Blue and UR Red&#91;5&#93;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;Athletics\nNCAA Division I – A-10Nickname\nSpidersAffiliations\nACSSURACICMascot\nWebstUR the Spider&#91;6&#93;Website\nwww.richmond.edu\n\nThe University of Richmond (UR or U of R) is a private, nonsectarian, liberal arts college located in the ci'

Let's now replace all copies of the special characters \n, \r, and \t with a single space in the variable website. Make sure to save the result again as the variable website. (Note: You can do this with three seperate calls to re.sub; try to do it with just a single line).

In [9]:
website = re.sub('[\n\r\t]', ' ', website)
print(website[:1000])
                  University of Richmond        From Wikipedia, the free encyclopedia             Jump to navigation     Jump to search     University of Richmond Motto Verbum Vitae et Lumen Scientiae (Latin)Motto in&#160;English Word of life and the light of knowledge &#91;1&#93;Type PrivateEstablished 1830&#160;(1830)Endowment $2.37 billion (2017)&#91;2&#93;President Ronald CrutcherAcademic staff 612 (402 full-time, 210 part-time)&#91;3&#93;Students 4,131&#91;4&#93;Undergraduates 3,254 (3,052 full-time, 202 part-time)&#91;4&#93;Postgraduates 877 (500 full-time, 377 part-time)&#91;4&#93;Location Richmond, Virginia, U.S.Campus Suburban, 350 acres (1.4&#160;km2)Colors UR Blue and UR Red&#91;5&#93;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;Athletics NCAA Division I – A-10Nickname SpidersAffiliations ACSSURACICMascot WebstUR the Spider&#91;6&#93;Website www.richmond.edu  The University of Richmond (UR or U of R) is a private, nonsectarian, liberal arts college located in the ci

Now we are really getting close to the raw text on the page!

As a next step, use the string method .lower() to make the website all in lower case. This will help later so that words like "School" and "school" are not counted differently. Print out the result again just to check your code.

In [10]:
website = website.lower()
print(website[:1000])
                  university of richmond        from wikipedia, the free encyclopedia             jump to navigation     jump to search     university of richmond motto verbum vitae et lumen scientiae (latin)motto in&#160;english word of life and the light of knowledge &#91;1&#93;type privateestablished 1830&#160;(1830)endowment $2.37 billion (2017)&#91;2&#93;president ronald crutcheracademic staff 612 (402 full-time, 210 part-time)&#91;3&#93;students 4,131&#91;4&#93;undergraduates 3,254 (3,052 full-time, 202 part-time)&#91;4&#93;postgraduates 877 (500 full-time, 377 part-time)&#91;4&#93;location richmond, virginia, u.s.campus suburban, 350 acres (1.4&#160;km2)colors ur blue and ur red&#91;5&#93;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;athletics ncaa division i – a-10nickname spidersaffiliations acssuracicmascot webstur the spider&#91;6&#93;website www.richmond.edu  the university of richmond (ur or u of r) is a private, nonsectarian, liberal arts college located in the ci

There are still a number of special formatting marks, as well as punctuation and other special characters, in the this text. As a simple solution, in the code block below write a call to the re.sub function that replaces anything that is not a lower case letter with a space. (Hint: you did this exact thing in Tutorial 5). Once again, print out the result at the end of the code block.

In [11]:
website = re.sub('[^a-z]', ' ', website)
print(website[:1000])
                  university of richmond        from wikipedia  the free encyclopedia             jump to navigation     jump to search     university of richmond motto verbum vitae et lumen scientiae  latin motto in      english word of life and the light of knowledge            type privateestablished                 endowment       billion                  president ronald crutcheracademic staff          full time      part time            students                 undergraduates              full time      part time            postgraduates          full time      part time            location richmond  virginia  u s campus suburban      acres           km  colors ur blue and ur red                                                                 athletics ncaa division i   a   nickname spidersaffiliations acssuracicmascot webstur the spider           website www richmond edu  the university of richmond  ur or u of r  is a private  nonsectarian  liberal arts college located in the ci

As a final step in cleaning the output, notice that from the cleaning process there are many places that have a long set of spaces inbetween them. Use a regular expression to convert any sequence of spaces into a single space. And again, print out the result.

In [12]:
website = re.sub('[ ]+', ' ', website)
print(website[:1000])
 university of richmond from wikipedia the free encyclopedia jump to navigation jump to search university of richmond motto verbum vitae et lumen scientiae latin motto in english word of life and the light of knowledge type privateestablished endowment billion president ronald crutcheracademic staff full time part time students undergraduates full time part time postgraduates full time part time location richmond virginia u s campus suburban acres km colors ur blue and ur red athletics ncaa division i a nickname spidersaffiliations acssuracicmascot webstur the spider website www richmond edu the university of richmond ur or u of r is a private nonsectarian liberal arts college located in the city of richmond virginia with small portions of the campus extending into surrounding henrico county university of richmond is a primarily undergraduate residential university with approximately undergraduate and graduate students in five schools the school of arts and sciences the e claiborne rob

Now you should have a nice clear version of just the words in the Wikipedia page. Yay! See how awesome regular expression can be!

Extracting Words

Now that we have the raw text as a single string, we will want to using the function re.split to split apart the individual words. Do this in the block below, saving the result as a variable called words; print out the words with the print function at the end of the code block.

In [13]:
words = re.split(' ', website)

As we move forward in the course, we will see a number of things that can be done with these words such as building predictive and generative models. For today, let's just find the most frequently used words on the page. To do this with a minimal amount of code, we will load a function called Counter from the module collections:

In [14]:
from collections import Counter

If you saved the result above as a variable called words, as instructed, the following code will then spit out the 30 most common words in the text along with their counts. You can of course change the number 30 to anything you would like, but 30 seems to be work well for this exercise.

In [15]:
Counter(words).most_common(30)
Out[15]:
[('the', 310),
 ('of', 239),
 ('and', 114),
 ('university', 112),
 ('richmond', 112),
 ('in', 109),
 ('to', 73),
 ('a', 72),
 ('college', 63),
 ('school', 57),
 ('s', 52),
 ('on', 46),
 ('for', 41),
 ('from', 39),
 ('campus', 39),
 ('virginia', 36),
 ('retrieved', 35),
 ('students', 30),
 ('is', 28),
 ('arts', 27),
 ('student', 27),
 ('as', 26),
 ('original', 26),
 ('was', 23),
 ('edit', 22),
 ('colleges', 22),
 ('with', 21),
 ('amp', 21),
 ('archived', 21),
 ('robins', 19)]

Are these the words you would have expected to be the most common on the University of Richmond Wikipedia page? Why or why not?

Answer: This is a mix of function words ('the', 'of', and 'and') as well as topical words that I would expect to be associated with UR ('richmond', 'university', 'student').

Wrapping it all up

In this tutorial I broke down all of the steps in requesting, cleaning, and counting the most frequent words from a page on Wikipedia. The entire process when combined requires only a total of about 10 lines. In the code block below I want you to put all of the steps together, with the page url at the top (here I put a new URL, the one to the page about Marxism). At the end of the block the 30 most common words on the page should show up.

In [16]:
url = 'https://en.wikipedia.org/wiki/Marxism'

r = requests.get(url)
website = r.text
website = website[website.find("<body"):website.find("<noscript>")]
website = re.sub('<[^>]+>', '', website)
website = website.lower()
website = re.sub('[^a-z]', ' ', website)
website = re.sub('[ ]+', ' ', website)
words = re.split(' ', website)
Counter(words).most_common(30)
Out[16]:
[('the', 648),
 ('of', 547),
 ('and', 317),
 ('in', 189),
 ('to', 173),
 ('a', 155),
 ('marx', 118),
 ('marxism', 101),
 ('that', 95),
 ('s', 93),
 ('is', 93),
 ('marxist', 92),
 ('as', 73),
 ('class', 69),
 ('social', 65),
 ('by', 62),
 ('production', 61),
 ('theory', 52),
 ('society', 47),
 ('economic', 44),
 ('on', 44),
 ('socialism', 44),
 ('from', 43),
 ('with', 41),
 ('political', 40),
 ('history', 40),
 ('karl', 40),
 ('engels', 38),
 ('for', 37),
 ('socialist', 34)]

Once you have the code tested and working, try to input several other Wikipedia pages and begin exploring what you see in the data. (Note: This should now be easy as you have only to run a single block of code). Are there any interesting patterns or missing words that start to show up?

Answer: Most of the common terms are function words, such as conjuctions, articles, and prepositions, rather than content words. The few content words are all reasonable: marx, marxism, marxists, class, social, ect. The content words show up more towards the end of the list.