Tutorial 30: MediaWiki Page History¶

Here, we return to our study of the MediaWiki API. This will be a useful review of how to grab and parse data, both are key steps in doing work in data science. Specifically, we will see how to grab and parse the history of a wikipedia page.

In [1]:
import json
import os
from os.path import join
import re
import requests

import wiki
import iplot
import matplotlib.pyplot as plt


I am going to demonstrate at first using the Wikipedia page for coffee. You'll be able to look at other pages later in the tutorial. To start, let's grab the page JSON file (either from disk or downloaded through the functions provided in wiki.py). I'll also print out the keys in the returned dictionary object.

In [2]:
page_json = wiki.get_wiki_json("Coffee")
page_json.keys()

Out[2]:
dict_keys(['title', 'pageid', 'revid', 'redirects', 'text', 'langlinks', 'categories', 'links', 'templates', 'images', 'externallinks', 'sections', 'parsewarnings', 'displaytitle', 'iwlinks', 'properties'])

We have not done anything with the 'revid', but this will be key for our work today. The 'revid' is a unique key that describes a particular version of a page. It is unique on it's own; you do not need to combine it with the 'pageid' (I found this confusing in the MediaWiki documentation).

Query API¶

In order to get previous revision ids for a given page, we need to use the 'query' API. Last time, and throughout the semester so far, we have been only using the 'parse' functionality of the MediaWiki API. Let's see how the query API works. Recall that we need a base API URL that the query is sent to. This remains unchanged from tutorial 8:

In [3]:
lang = 'en'
base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php?'
base_api_url

Out[3]:
'https://en.wikipedia.org/w/api.php?'

Now, we produce a query by providing variable keys separated by the amperstand (&) symbol. For the query API we need to specify the following:

• action=query to describe the action that we want to take
• format=json to let the API know we want data returned as JSON objects
• prop=revisions to tell the API that we want to see the page revisions
• rvprop=ids|flags|timestamp|comment|user indicate what data to return about the revision
• rvstartid=###### to indicate the revision we want to start at, with the proper revid filled in for the ########

Start by extracting the revid for the Coffee page:

In [4]:
pageid = page_json['pageid']
pageid

Out[4]:
604727
In [5]:
revid = page_json['revid']
revid

Out[5]:
865835930

And then construct the API query:

In [6]:
api_query = base_api_url + "action=query&" + "format=json&" + \
"prop=revisions&" + "rvprop=ids|flags|timestamp|comment|user&" \
"pageids={0:d}&".format(pageid) + \
"rvstartid={0:d}&".format(revid)
api_query

Out[6]:
'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&rvprop=ids|flags|timestamp|comment|user&pageids=604727&rvstartid=865835930&'

Finally, we can call the API and have the page data returned to us. Take a few minutes to look at the output before moving on the next section.

In [7]:
r = requests.get(api_query)
page_data = r.json()
page_data

Out[7]:
{'continue': {'rvcontinue': '20181003154447|862314871', 'continue': '||'},
'query': {'pages': {'604727': {'pageid': 604727,
'ns': 0,
'title': 'Coffee',
'revisions': [{'revid': 865835930,
'parentid': 865835769,
'timestamp': '2018-10-26T13:55:56Z',
'comment': '/* Coffeehouses */'},
{'revid': 865835769,
'parentid': 865834651,
'timestamp': '2018-10-26T13:54:31Z',
'comment': '/* Coffeehouses */'},
{'revid': 865834651,
'parentid': 864501853,
'minor': '',
'timestamp': '2018-10-26T13:44:33Z',
'comment': 'no need for those pictures in the lead'},
{'revid': 864501853,
'parentid': 864501313,
'user': 'Heroeswithmetaphors',
'timestamp': '2018-10-17T17:07:50Z',
{'revid': 864501313,
'parentid': 864501192,
'user': 'Heroeswithmetaphors',
'timestamp': '2018-10-17T17:03:25Z',
{'revid': 864501192,
'parentid': 863262382,
'user': 'Heroeswithmetaphors',
'timestamp': '2018-10-17T17:02:20Z',
{'revid': 863262382,
'parentid': 863213758,
'user': 'Funfactsandwheretofindthem',
'timestamp': '2018-10-09T18:06:40Z',
'comment': "/* Coffeehouses */ Updated Starbucks' outlet numbers from 2010 date"},
{'revid': 863213758,
'parentid': 863088071,
'minor': '',
'timestamp': '2018-10-09T12:08:40Z',
'comment': ''},
{'revid': 863088071,
'parentid': 863082281,
'user': 'Zefr',
'timestamp': '2018-10-08T17:00:11Z',
'comment': '/* Method of action */ add more focused review using [[WP:UCB|Citation bot]]'},
{'revid': 863082281,
'parentid': 862314871,
'user': 'Supersarah1121',
'timestamp': '2018-10-08T16:13:31Z',
'comment': 'Medical Citation Added'}]}}}}

What information is shown about revision? What information does the parameter 'parentid' give you about the revision? How many revisions are returned here?

Parsing page dates¶

I want you to try writing some code again. Cycle through the output of page_data to produce a python list named rev_years that gives the year of each revision. Note: you might start by storing the entire time stamp and then modify the code to store just the year. Second note: You don't need to be too clever to get the year from the time stamp, just take the first four characters in the string.

In [8]:
revisions = page_data['query']['pages']['604727']['revisions']
len(revisions)

Out[8]:
10
In [9]:
rev_years = []

for rev in revisions:
rev_years.append(rev['timestamp'][:4])

rev_years

Out[9]:
['2018',
'2018',
'2018',
'2018',
'2018',
'2018',
'2018',
'2018',
'2018',
'2018']

Once you have created the list rev_years, run the following to show how many revisions there are in each year.

In [10]:
import collections

collections.Counter(rev_years)

Out[10]:
Counter({'2018': 10})

The output likely won't be too interesting yet because you only have the first 10 revisions of the page. Let's see how to rectify that now.

Increasing the page limit¶

By default, MediaWiki returns only the last 10 revisions. We can fetch up to 500 revisions by adding the parameter rvlimit=max to the query. Modify the variable api_query and fetch the modifyied data. I suggest NOT printing out the results of page_data because there will be a lot of them!

In [11]:
api_query = base_api_url + "action=query&" + "format=json&" + \
"prop=revisions&" + "rvprop=ids|flags|timestamp|comment|user&" \
"rvlimit=max&" + \
"pageids={0:d}&".format(pageid) + \
"rvstartid={0:d}&".format(revid)
r = requests.get(api_query)
page_data = r.json()

revisions = page_data['query']['pages']['604727']['revisions']


Now, copy the code you had above to find the distribution of years for the first 500 revisions.

In [12]:
rev_years = []

for rev in revisions:
rev_years.append(rev['timestamp'][:4])

collections.Counter(rev_years)

Out[12]:
Counter({'2018': 112, '2017': 386, '2016': 2})

You should see about 112 revisions for 2018, 386 for 2017, and just 2 for 2016. This is a bit more interesting, but still not the whole story because even the maximum query returns just 500 pages.

Continuing the query¶

In order to get more results from the API we must place another query request. This is similar to seeing the "next page" of results on a site such as Google or Amazon when searching for a webpage or product. The idea here is common to many API's such as the ones used by Google, Twitter, and Facebook.

Notice that the variable page_data contains an element named continue:

In [13]:
page_data['continue']

Out[13]:
{'rvcontinue': '20161214161444|754802248', 'continue': '||'}

Create a variable api_query_continue that adds the parameter rvcontinue=###### with the ##### filled in from the above value to the variable api_query.

In [14]:
api_query_continue = api_query + "rvcontinue=" + page_data['continue']['rvcontinue']


Next, call this query and load the data into the variable page_data.

In [15]:
r = requests.get(api_query_continue)
page_data = r.json()


And now see the distribution of page years in this new chunk of data:

In [16]:
rev_years = []

for rev in page_data['query']['pages'][str(pageid)]['revisions']:
rev_years.append(rev['timestamp'][:4])

collections.Counter(rev_years)

Out[16]:
Counter({'2016': 322, '2015': 178})

You should see about 322 revisions in 2016 and another 178 in 2015.

Getting all of the pages¶

We still do not have all of the Coffee pages, just the next 500 of them. In order to get all of the pages we have to cycle through these continue statements until we reach the end of the pages. Let's write the code to handle this now. Rather than just grabbing the years, we will construct a list of all the information about each revision.

I've written the function here for you to generate the list of revisions, but you should be able to understand what is going on in all of the code. It spits out the progress of the API by showing the number of revisions grabbed as well as indicating what the last timestamp grabbed was.

In [17]:
def wiki_page_revisions(page_title):
page_json = wiki.get_wiki_json(page_title)
pageid = page_json['pageid']
revid = page_json['revid']

api_query = base_api_url + "action=query&" + "format=json&" + \
"prop=revisions&" + "rvprop=ids|flags|timestamp|comment|user&" \
"rvlimit=max&" + \
"pageids={0:d}&".format(pageid) + \
"rvstartid={0:d}&".format(revid)
r = requests.get(api_query)
page_data = r.json()

rev_data = page_data['query']['pages'][str(pageid)]['revisions']

while 'continue' in page_data:
api_query_continue = api_query + \
"rvcontinue={0:s}&".format(page_data['continue']['rvcontinue'])
r = requests.get(api_query_continue)
page_data = r.json()
rev_data += page_data['query']['pages'][str(pageid)]['revisions']
msg = "Loaded {0:d} revisions, through {1:s}"
print(msg.format(len(rev_data), rev_data[-1]['timestamp']))

return rev_data

In [18]:
rev_data = wiki_page_revisions("Coffee")

Loaded 1000 revisions, through 2015-10-06T08:23:46Z


Just looking at the message output, how many revisions have been made to the Coffee page and when was the page first created?

Modify your code you used above to grab the list rev_years from the list rev_data.

In [19]:
 rev_years = []

for rev in rev_data:
rev_years.append(rev['timestamp'][:4])

collections.Counter(rev_years)

Out[19]:
Counter({'2018': 112,
'2017': 386,
'2016': 324,
'2015': 533,
'2014': 532,
'2013': 682,
'2012': 589,
'2011': 624,
'2010': 1065,
'2009': 1059,
'2008': 1073,
'2007': 1842,
'2006': 1594,
'2005': 692,
'2004': 129})

In what year were the most revisions completed? Has 2018 had an unusually high or lower number of revisions at this point in the year?

Finally, you can even use the following code to produce a line plot of the number of revisions in each year.

In [20]:
plt.rcParams["figure.figsize"] = (12, 10)

In [21]:
cnt = collections.Counter(rev_years).items()
cnt = sorted(cnt, key=lambda x: x[0])
plt.xticks(rotation=90)
plt.plot([x[0] for x in cnt], [x[1] for x in cnt], 'k-', lw=2)

Out[21]:
[<matplotlib.lines.Line2D at 0x11fbbb3c8>]

Revision data¶

We now have some information about each of the page revisions, but we still have not seen how to grab the actual page data from a given revision. To do this, we need to return to the "parse" API with our revision ids in hand. Essentially, all we need to do is call the parse action, specify the format as JSON, and provid the revid to the parameter oldid. So, to get the very first version of the coffee page we would first get the revision id:

In [22]:
revid = rev_data[-1]['revid']
revid

Out[22]:
3245467

And then place the following API query:

In [23]:
api_query = base_api_url + "action=parse&" + "format=json&" + \
"oldid={0:d}&".format(revid)
r = requests.get(api_query)
page_data = r.json()['parse']


This page is now in the format we have been working with the rest of the semester but gives and old version of the page, way back from 2014.

In [24]:
page_data.keys()

Out[24]:
dict_keys(['title', 'pageid', 'revid', 'text', 'langlinks', 'categories', 'links', 'templates', 'images', 'externallinks', 'sections', 'parsewarnings', 'displaytitle', 'iwlinks', 'properties'])

It would be fun to see what this page actually looks like rendered as html. Let's write it to disk with some header information to make it look reasonable:

In [25]:
with open('temp.html', 'w', encoding="UTF-8") as fin:
fin.writelines("<html><body>")
fin.writelines(page_data['text']['*'])
fin.writelines("</body></html>")


You should see that the page was very basic back in 2004!

A different page¶

Repeat the code above (you can drop all of the steps into one or two code blocks) to look at another page that interests you.

In [27]:
rev_data = wiki_page_revisions("Data Science")
rev_years = []

for rev in rev_data:
rev_years.append(rev['timestamp'][:4])

cnt = collections.Counter(rev_years).items()
cnt = sorted(cnt, key=lambda x: x[0])
plt.xticks(rotation=90)
plt.plot([x[0] for x in cnt], [x[1] for x in cnt], 'k-', lw=2)

Pulling data from MediaWiki API: 'Data Science'

Out[27]:
[<matplotlib.lines.Line2D at 0x11e13ab00>]

How does the pattern of number of changes differ from the coffee page?

What next¶

We certainly could grab the revision history for every change that has been made to the Coffee page. Most changes, though, are not particularly interesting on their own (at least for our level of study here). Instead, I want to focus on large-scale change over time by grabbing one page for each year in the collection. Starting with the values in rev_data, I want you to use the space below to grab the last version of the pages on Coffee for each year. That is, start with the current page, then get the last page from 2017, then 2016, and so forth until you get back to the page at the end of 2004. Store these pages as a list named page_history.

In [28]:
# Note: run this cell just once to refresh the value in rev_data
# from running the code to grab a different page above
rev_data = wiki_page_revisions("Coffee")

Loaded 1000 revisions, through 2015-10-06T08:23:46Z

In [49]:
base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php?'

page_history = []
last_year = int(rev_data[0]['timestamp'][:4]) + 1

for rev in rev_data:
this_year = int(rev['timestamp'][:4])
if this_year < last_year:
last_year = this_year

# grab the page
revid = rev['revid']
api_query = base_api_url + "action=parse&" + "format=json&" + \
"oldid={0:d}&".format(revid)
r = requests.get(api_query)
page_data = r.json()['parse']

page_history.append((rev, page_data))

# output progress
print("Grabbed page at {0:d}".format(revid))

Grabbed page at 865835930
Grabbed page at 817267881
Grabbed page at 755468653
Grabbed page at 697272803
Grabbed page at 640191769
Grabbed page at 588559107
Grabbed page at 530615673
Grabbed page at 468752501
Grabbed page at 405119400
Grabbed page at 335170993
Grabbed page at 260942162
Grabbed page at 181074859
Grabbed page at 97628123
Grabbed page at 33382948
Grabbed page at 9036398


Then, when you are done with that, cycle through the pages to extract the length of the page in characters, the number of internal links, the number of external links, the number of images, and the number of sections.

In [59]:
import xml.etree.ElementTree as ET
import pandas as pd

meta = dict(timestamp=[], pageid=[], revid=[], link=[], title=[], year=[],

for rev, page_data in page_history:
tree = ET.fromstring(page_data['text']['*'])
meta['timestamp'].append(rev['timestamp'])
meta['pageid'].append(page_data['pageid'])
meta['revid'].append(rev['revid'])
meta['title'].append(re.sub('<[^>]+>', '', page_data['displaytitle']))
meta['year'].append(int(rev['timestamp'][:4]))
meta['num_chars'].append(len(page_data['text']['*']))
meta['num_p'].append(len(tree.findall(".//p")))
meta['num_sections'].append(len(page_data['sections']))
meta['num_images'].append(len(page_data['images']))

pdf = pd.DataFrame(meta).drop_duplicates()
pdf

Out[59]:
0 2018-10-26T13:55:56Z 604727 865835930 Coffee Coffee 2018 448075 130 42 43 681 241 169
1 2017-12-27T07:58:07Z 604727 817267881 Coffee Coffee 2017 437723 128 40 47 660 231 169
2 2016-12-18T05:42:22Z 604727 755468653 Coffee Coffee 2016 459558 133 44 50 652 239 169
3 2015-12-29T11:13:18Z 604727 697272803 Coffee Coffee 2015 433789 126 42 44 644 198 169
4 2014-12-30T04:41:08Z 604727 640191769 Coffee Coffee 2014 421406 126 44 58 595 183 169
5 2013-12-31T18:54:37Z 604727 588559107 Coffee Coffee 2013 374642 129 43 56 556 140 169
6 2012-12-31T14:43:47Z 604727 530615673 Coffee Coffee 2012 360744 119 38 48 537 152 169
7 2011-12-31T10:26:36Z 604727 468752501 Coffee Coffee 2011 358617 114 33 54 536 151 169
8 2010-12-31T05:22:50Z 604727 405119400 Coffee Coffee 2010 328394 89 32 49 498 139 170
9 2009-12-31T20:57:40Z 604727 335170993 Coffee Coffee 2009 244457 65 23 46 391 103 170
10 2008-12-30T20:11:56Z 604727 260942162 Coffee Coffee 2008 194285 50 21 35 320 89 170
11 2007-12-30T23:01:42Z 604727 181074859 Coffee Coffee 2007 160272 39 16 18 299 68 169
12 2006-12-31T23:39:41Z 604727 97628123 Coffee Coffee 2006 95116 31 21 14 234 25 169
13 2005-12-31T15:36:40Z 604727 33382948 Coffee Coffee 2005 114129 79 39 18 298 34 169
14 2004-12-30T20:13:34Z 604727 9036398 Coffee Coffee 2004 33909 33 15 3 86 5 169

Finally, plot the data across time for each of these variables. For example, if you stored the data for the number of internal links as a list named num_ilinks, you should be able to run something like this:

In [60]:
plt.xticks(rotation=90)
plt.plot(list(range(2018,2003,-1)), pdf['num_chars'], 'k-', lw=2)

Out[60]:
[<matplotlib.lines.Line2D at 0x11da4c048>]
In [61]:
plt.xticks(rotation=90)
plt.plot(list(range(2018,2003,-1)), pdf['num_p'], 'k-', lw=2)

Out[61]:
[<matplotlib.lines.Line2D at 0x120e34eb8>]
In [62]:
plt.xticks(rotation=90)
plt.plot(list(range(2018,2003,-1)), pdf['num_sections'], 'k-', lw=2)

Out[62]:
[<matplotlib.lines.Line2D at 0x12064ec50>]
In [63]:
plt.xticks(rotation=90)
plt.plot(list(range(2018,2003,-1)), pdf['num_images'], 'k-', lw=2)

Out[63]:
[<matplotlib.lines.Line2D at 0x121452e10>]
In [64]:
plt.xticks(rotation=90)

Out[64]:
[<matplotlib.lines.Line2D at 0x1206e5710>]
In [66]:
plt.xticks(rotation=90)
plt.plot(list(range(2018,2003,-1)), pdf['num_langs'], 'k-', lw=2)

Out[66]:
[<matplotlib.lines.Line2D at 0x1206e07f0>]
In [ ]:
plt.xticks(rotation=90)


Take note of any interesting patterns that arise over time with the pages.

Even more practice¶

My guess is that the above tasks will take up most of the class time. If you want extra practice or finish early, wrap up the code above in a function that takes just a page name and returns all of the metrics as a panda's DataFrame object. Include the timestamp as the first column of the data frame and make sure that you handle the case where the number of years may be different and there may even be no revisions in a given year.

In [68]:
def get_page_history(page_title):

rev_data = wiki_page_revisions(page_title)
base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php?'

page_history = []
last_year = int(rev_data[0]['timestamp'][:4]) + 1

for rev in rev_data:
this_year = int(rev['timestamp'][:4])
if this_year < last_year:
last_year = this_year

# grab the page
revid = rev['revid']
api_query = base_api_url + "action=parse&" + "format=json&" + \
"oldid={0:d}&".format(revid)
r = requests.get(api_query)
page_data = r.json()['parse']

page_history.append((rev, page_data))

# output progress
print("Grabbed page at {0:d}".format(revid))

return page_history

In [69]:
def get_page_history_meta(page_title):
import xml.etree.ElementTree as ET
import pandas as pd

page_history = get_page_history(page_title)

meta = dict(timestamp=[], pageid=[], revid=[], link=[], title=[], year=[],

for rev, page_data in page_history:
tree = ET.fromstring(page_data['text']['*'])
meta['timestamp'].append(rev['timestamp'])
meta['pageid'].append(page_data['pageid'])
meta['revid'].append(rev['revid'])
meta['title'].append(re.sub('<[^>]+>', '', page_data['displaytitle']))
meta['year'].append(int(rev['timestamp'][:4]))
meta['num_chars'].append(len(page_data['text']['*']))
meta['num_p'].append(len(tree.findall(".//p")))
meta['num_sections'].append(len(page_data['sections']))
meta['num_images'].append(len(page_data['images']))

pdf = pd.DataFrame(meta).drop_duplicates()
return pdf

In [71]:
pdf = get_page_history_meta('Statistics')
pdf

Loaded 1000 revisions, through 2013-04-30T23:42:38Z
Grabbed page at 867661767
Grabbed page at 817612155
Grabbed page at 754986487
Grabbed page at 697433595
Grabbed page at 639300278
Grabbed page at 588463335
Grabbed page at 530307227
Grabbed page at 468573297
Grabbed page at 403631349
Grabbed page at 334749989
Grabbed page at 260637347
Grabbed page at 181279972
Grabbed page at 97370751
Grabbed page at 33018695
Grabbed page at 9191083
Grabbed page at 2220137
Grabbed page at 556462
Grabbed page at 279816

Out[71]:
0 2018-11-07T05:14:26Z 26685 867661767 Statistics Statistics 2018 240263 69 30 23 642 54 131
1 2017-12-29T14:02:01Z 26685 817612155 Statistics Statistics 2017 217251 71 30 16 575 47 131
2 2016-12-15T16:32:00Z 26685 754986487 Statistics Statistics 2016 220733 69 30 25 578 43 131
3 2015-12-30T14:08:50Z 26685 697433595 Statistics Statistics 2015 213930 68 30 24 574 38 131
4 2014-12-23T07:13:33Z 26685 639300278 Statistics Statistics 2014 191070 67 28 24 547 17 131
5 2013-12-31T02:50:14Z 26685 588463335 Statistics Statistics 2013 165371 53 20 18 528 20 131
6 2012-12-29T15:29:36Z 26685 530307227 Statistics Statistics 2012 153761 49 20 15 514 12 131
7 2011-12-30T10:17:04Z 26685 468573297 Statistics Statistics 2011 149561 47 24 14 506 22 131
8 2010-12-22T01:49:03Z 26685 403631349 Statistics Statistics 2010 145910 46 24 15 501 22 131
9 2009-12-29T17:08:16Z 26685 334749989 Statistics Statistics 2009 140638 49 24 14 490 23 131
10 2008-12-29T07:58:39Z 26685 260637347 Statistics Statistics 2008 124992 36 16 15 488 13 131
11 2007-12-31T22:52:04Z 26685 181279972 Statistics Statistics 2007 137766 49 20 9 539 39 131
12 2006-12-30T18:06:41Z 26685 97370751 Statistics Statistics 2006 68782 47 21 3 214 27 131
13 2005-12-28T18:16:20Z 26685 33018695 Statistics Statistics 2005 52753 25 17 6 182 45 131
14 2004-12-31T01:51:03Z 26685 9191083 Statistics Statistics 2004 19079 18 8 0 69 18 131
15 2003-12-31T18:44:55Z 26685 2220137 Statistics Statistics 2003 9822 12 3 0 53 3 131
16 2002-12-28T08:47:09Z 26685 556462 Statistics Statistics 2002 2743 4 2 0 12 1 131
17 2001-11-03T02:49:39Z 26685 279816 Statistics Statistics 2001 3055 12 0 0 10 0 131