Tutorial 10: Looping through Wikipedia

In this tutorial, we combine our lists and loops with the MediaWiki API functions to grab data from several websites in an automated way.

Modules

We will need functions that I gave last class for loading data from Wikipedia again today, as well as for the foreseeable future. Rather than having to copy and paste them each time, there is an easy way to load these functions from a common file.

I've created the file wiki.py that you should download from the course website and put into the same directory that you store your tutorials. You can open and edit the file in Jupyter, which I suggest you do right now to get a sense of what the file looks like. It is basically one long code cell. To load the functions in this file, we write import along with the name of the file (without the extension).

In [1]:
import wiki

Now, to get one of the functions in the module, we use the normal "module name" + "." + "function name" calling convention. So, to get the function wiki_json_path we would do this:

In [2]:
wiki.wiki_json_path("University of Richmond")
Out[2]:
'/Users/taylor/gh/stat289-fall-2018-statsmaths/data/en/University_of_Richmond.json'

Remember that you can see the help page for a function like this:

In [3]:
help(wiki.wiki_json_path)
Help on function wiki_json_path in module wiki:

wiki_json_path(page_title, lang='en')
    Returns local path to JSON file for Wikipedia page data.
    
    This function is used to determine where the dump of a
    call to the MediaWiki API, using the parse method, should
    be stored. As an extra action, the function also checks that
    the relevant directory exists and creates it if it does not.
    
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
    
    Returns:
        A string describing a relative path to file.

I've made a few small changes to the code in wiki.py to make it function a bit better for us and to deal with some annoying edge cases. I may need to fix some other edge cases as we work through the data (pages like "AC/DC" and "Guns & Roses" failed on the original code).

Dictionaries

We saw last time that internal links, links to other pages on Wikipedia, are returned as a particular element of the JSON data returned by the MediaWiki API. Here, we will show how to extract data from the JSON object.

Let's start by loading the data from a single Wikipedia page. As I mentioned briefly last time, the Python object that stores JSON data is called a "dict" (short for dictionary).

In [4]:
data = wiki.get_wiki_json("University of Richmond")
type(data)
Out[4]:
dict

A dictionary is similar to a list in that it stores an collection of items. While a list keeps all of the items in a particular order, a dictionary associated each element to a named "key". We saw these keys in the JSON file from last time. To see all of the keys in a particular dictionary, use the keys method:

In [5]:
data.keys()
Out[5]:
dict_keys(['title', 'pageid', 'revid', 'text', 'langlinks', 'categories', 'links', 'templates', 'images', 'externallinks', 'sections', 'parsewarnings', 'displaytitle', 'iwlinks', 'properties'])

To grab an element from the dictionary, we use square brackets with the name (in quotes) of the desired key. Again, similar to a list but with a twist. Here I'll print out the title of the page.

In [6]:
data['title']
Out[6]:
'University of Richmond'

The title returns a single string, but its possible that dictionarie elements consists of a list or even another dictionary.

In [7]:
type(data['langlinks'])
Out[7]:
list
In [8]:
data['langlinks']
Out[8]:
[{'lang': 'az',
  'url': 'https://az.wikipedia.org/wiki/Ri%C3%A7mond_Universiteti',
  'langname': 'Azerbaijani',
  'autonym': 'azərbaycanca',
  '*': 'Riçmond Universiteti'},
 {'lang': 'de',
  'url': 'https://de.wikipedia.org/wiki/University_of_Richmond',
  'langname': 'German',
  'autonym': 'Deutsch',
  '*': 'University of Richmond'},
 {'lang': 'es',
  'url': 'https://es.wikipedia.org/wiki/Universidad_de_Richmond',
  'langname': 'Spanish',
  'autonym': 'español',
  '*': 'Universidad de Richmond'},
 {'lang': 'it',
  'url': 'https://it.wikipedia.org/wiki/Universit%C3%A0_di_Richmond',
  'langname': 'Italian',
  'autonym': 'italiano',
  '*': 'Università di Richmond'},
 {'lang': 'fi',
  'url': 'https://fi.wikipedia.org/wiki/Richmondin_yliopisto',
  'langname': 'Finnish',
  'autonym': 'suomi',
  '*': 'Richmondin yliopisto'},
 {'lang': 'sv',
  'url': 'https://sv.wikipedia.org/wiki/University_of_Richmond',
  'langname': 'Swedish',
  'autonym': 'svenska',
  '*': 'University of Richmond'},
 {'lang': 'ur',
  'url': 'https://ur.wikipedia.org/wiki/%DB%8C%D9%88%D9%86%DB%8C%D9%88%D8%B1%D8%B3%D9%B9%DB%8C_%D8%A2%D9%81_%D8%B1%DA%86%D9%85%D9%86%DA%88',
  'langname': 'Urdu',
  'autonym': 'اردو',
  '*': 'یونیورسٹی آف رچمنڈ'},
 {'lang': 'zh',
  'url': 'https://zh.wikipedia.org/wiki/%E9%87%8C%E5%A3%AB%E6%BB%A1%E5%A4%A7%E5%AD%A6',
  'langname': 'Chinese',
  'autonym': '中文',
  '*': '里士满大学'}]

What if we want information about the Azerbaijani page for the University of Richmond? Well, this is just a list so grab the first element with [0] as usual:

In [9]:
data['langlinks'][0]
Out[9]:
{'lang': 'az',
 'url': 'https://az.wikipedia.org/wiki/Ri%C3%A7mond_Universiteti',
 'langname': 'Azerbaijani',
 'autonym': 'azərbaycanca',
 '*': 'Riçmond Universiteti'}

And from what data type is this element? Its another dictionary:

In [10]:
type(data['langlinks'][0])
Out[10]:
dict

And so we could grab an element, such as the language name, like this:

In [11]:
data['langlinks'][0]['langname']
Out[11]:
'Azerbaijani'

And if we want all of the language links? We need to combine our looping knowledge with the dictionary methods:

In [12]:
lang_names = []

for lang in data['langlinks']:
    lang_names = lang_names + [lang['langname']]
    
print(lang_names)
['Azerbaijani', 'German', 'Spanish', 'Italian', 'Finnish', 'Swedish', 'Urdu', 'Chinese']

Now, let's do something similar to get the internal links from our Wikipedia page. These are stored in the element named 'links' from the object data. Print out (the first few rows of) this object below:

In [13]:
data['links'][:10]
Out[13]:
[{'ns': 14,
  'exists': '',
  '*': 'Category:Articles with dead external links from August 2018'},
 {'ns': 14, 'exists': '', '*': 'Category:University of Richmond'},
 {'ns': 10, 'exists': '', '*': 'Template:University of Richmond'},
 {'ns': 10,
  'exists': '',
  '*': 'Template:Colleges and universities in Virginia'},
 {'ns': 10, 'exists': '', '*': 'Template:Associated Colleges of the South'},
 {'ns': 10,
  'exists': '',
  '*': 'Template:Southeastern Universities Research Association'},
 {'ns': 10, 'exists': '', '*': 'Template:Atlantic 10 Conference navbox'},
 {'ns': 0, 'exists': '', '*': '2008 Montana Grizzlies football team'},
 {'ns': 0, 'exists': '', '*': '2008 Richmond Spiders football team'},
 {'ns': 0,
  'exists': '',
  '*': "2010-11 Richmond Spiders men's basketball team"}]

Now, what kind of object are the links stored in? Use the type function below to figure this out:

In [14]:
type(data['links'])
Out[14]:
list

You should see that the links are stored as a list. Each element of the list is a particular link. Below, grab just the first (remember, this is element '0') link in the list:

In [15]:
data['links'][0]
Out[15]:
{'ns': 14,
 'exists': '',
 '*': 'Category:Articles with dead external links from August 2018'}

Use the type function again to detect the object type of a particular link.

In [16]:
type(data['links'][0])
Out[16]:
dict

You should see that this is a dictionary. Now (yes, there's more!) print out the names of the keys for this dictionary:

In [17]:
data['links'][0].keys()
Out[17]:
dict_keys(['ns', 'exists', '*'])

You should see that there are three elements in the dictionary. Here are what the three elements mean:

  • ns: an integer giving the "namespace" of the link. Each type of page has its own namespace. The links to "real" pages all have a code of '14'.
  • exists: this is an empty string. Its used because the element exists only 'exists' if the link is not dead (in other words, it links to a real page).
  • *: this is the actual internal link.

Print out the namespace of the first link:

In [18]:
data['links'][0]['ns']
Out[18]:
14

You should see that the namespace is 14 because the first link is to a Category page (Categories are always 14).

Now, do something similar to what I did in the prior section to create a list named internal_links that grabs all of the links (the elements under *). Print out the list at the bottom of the cell.

In [19]:
internal_links = []
for link in data['links']:
    internal_links = internal_links + [link['ns']]
    
print(internal_links)
[14, 14, 10, 10, 10, 10, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 11, 11, 11, 11, 11]

I wrote a small helper funtion links_as_list (defined in wiki.py) to extract the list of links from a webpage. It should work very similar to the code you wrote above (open the code file and check it!), but additionally only includes links is (1) the namespace is equal to 10 and (2) the page actually exists.

Let's use this to get all of the links of the University of Richmond page.

In [20]:
data = wiki.get_wiki_json("University of Richmond")
links = wiki.links_as_list(data)
links[:20]
Out[20]:
['2008 Montana Grizzlies football team',
 '2008 Richmond Spiders football team',
 "2010-11 Richmond Spiders men's basketball team",
 "2010–11 Kansas Jayhawks men's basketball team",
 "2011 Atlantic 10 Men's Basketball Tournament",
 "2011 NCAA Men's Division I Basketball Tournament",
 'A cappella',
 'Afroman',
 'Alcoa',
 'Alpha Kappa Alpha',
 'Alpha Phi Alpha',
 'Alpha Phi Omega',
 'Altria Group',
 'Aluminum',
 'Alumnus',
 'American Civil War',
 'American Jobs Act',
 'Appalachian College of Pharmacy',
 'Appalachian School of Law',
 'Associated Colleges of the South']

Now, a reasonable next step would be to grab the data associated with each of these pages. To download the data for the first link we would just do this:

In [21]:
data = wiki.get_wiki_json(links[0])
data.keys()
Out[21]:
dict_keys(['title', 'pageid', 'revid', 'text', 'langlinks', 'categories', 'links', 'templates', 'images', 'externallinks', 'sections', 'parsewarnings', 'displaytitle', 'iwlinks', 'properties'])

How do this automatically for all of the links? We want to make use of a for loop. A for loop cycles through all of the elements of a list and applies a set of instructions to each element.

Here's an example where we take each element in the list of links and print out just the first three letters:

In [22]:
for link in links[:20]:
    print(link[:3])
200
200
201
201
201
201
A c
Afr
Alc
Alp
Alp
Alp
Alt
Alu
Alu
Ame
Ame
App
App
Ass

If we want to grab the webpage data for each link from the UR page, we can now just do this (this will take a while the first time you run it, but will be quick the second time):

In [23]:
for link in links:
    wiki.get_wiki_json(link)

Using the MediaWiki data

Now, finally, we have the code and functionality to look at a collection of Wikipedia pages. Let's start with a simple task of counting how many links all of the pages linked from the Richmond site have. Pay attention to how I do this!

In [24]:
num_links = []
data_json = wiki.get_wiki_json("University of Richmond")
ur_links = wiki.links_as_list(data_json)

for link in ur_links:
    data = wiki.get_wiki_json(link)
    new_links = wiki.links_as_list(data)
    num_links.append(len(new_links))

Now, let's look at the results:

In [25]:
print(num_links)
[218, 183, 1, 208, 108, 1, 273, 52, 118, 624, 690, 198, 1, 2, 22, 1239, 267, 111, 176, 33, 54, 250, 98, 506, 179, 92, 777, 123, 625, 1618, 447, 99, 286, 94, 1, 181, 1, 81, 274, 486, 349, 475, 146, 44, 212, 322, 439, 369, 787, 611, 1, 169, 1, 930, 98, 946, 18, 32, 427, 218, 526, 308, 297, 381, 364, 401, 801, 604, 30, 48, 14, 240, 15, 348, 212, 155, 318, 1, 96, 420, 251, 134, 261, 379, 237, 612, 791, 462, 25, 8, 283, 338, 373, 286, 2034, 276, 500, 197, 829, 181, 793, 702, 293, 62, 418, 137, 1, 32, 16, 89, 547, 645, 202, 432, 429, 171, 453, 1, 228, 528, 465, 91, 176, 61, 126, 208, 282, 442, 318, 272, 169, 291, 683, 1, 1, 405, 207, 535, 358, 679, 428, 1, 121, 213, 91, 284, 276, 1122, 1, 400, 349, 159, 540, 494, 364, 2, 249, 232, 175, 1954, 75, 438, 462, 470, 171, 397, 448, 156, 152, 258, 109, 267, 248, 3, 36, 857, 274, 392, 330, 220, 506, 592, 345, 101, 102, 544, 699, 206, 1130, 250, 88, 184, 198, 16, 48, 507, 105, 419, 22, 479, 171, 227, 185, 369, 41, 359, 280, 227, 372, 263, 100, 71, 267, 532, 554, 571, 244, 168, 92, 1, 3, 114, 516, 728, 104, 75, 29, 97, 1, 265, 506, 124, 491, 1, 358, 227, 574, 297, 546, 351, 302, 589, 569, 348, 751, 675, 237, 507, 266, 179, 88, 318, 433, 741, 458, 455, 563, 366, 296, 644, 119, 591, 266, 131, 28, 559, 328, 264, 427, 826, 956, 271, 99, 1, 466, 233, 898, 846, 1461, 388, 104, 188, 91, 769, 354, 614, 306, 617, 375, 232, 367, 115, 61, 1, 733, 464, 224, 450, 1453]

What can we do with this? For starters, what's the average number of links on each page?

In [26]:
sum(num_links) / len(num_links)
Out[26]:
331.428093645485

How does this compare to the number of links from the Richmond site?

In [27]:
len(ur_links)
Out[27]:
299

Answer:

Practice

Take a look at the Wikipedia page on Birthday cake:

https://en.wikipedia.org/wiki/Birthday_cake

Below, write code that:

  1. Downloads all of the links from the "Birthday_cake" Wikipedia page.
  2. Then, extract from each page all of the links from that page and puts them together in one appended list called all_links.
  3. Use the collections.Counter object to find the 40 links that are used most across all of the pages.
  4. Think about the most frequent 40 pages and try to reason why these are the most common.
In [28]:
# Make sure all of the links are downloaded
data_json = wiki.get_wiki_json("Birthday_cake")
rr_links = wiki.links_as_list(data_json)

for link in rr_links:
    data = wiki.get_wiki_json(link)
In [29]:
# Now, collect all of the links as a single list
all_links = []

for link in rr_links:
    data = wiki.get_wiki_json(link)
    new_links = wiki.links_as_list(data)
    all_links = all_links + new_links
In [30]:
# Now, count most frequent links
from collections import Counter

Counter(all_links).most_common(40)
Out[30]:
[('Cake', 169),
 ('List of cakes', 168),
 ('Birthday cake', 166),
 ('Biscuit', 166),
 ('Cremeschnitte', 166),
 ('Kürtőskalács', 166),
 ('Mantecadas', 166),
 ('Mille-feuille', 166),
 ('Tompouce', 166),
 ('Pancake', 165),
 ('Bizcocho', 165),
 ('Chorley cake', 165),
 ('Croquembouche', 165),
 ("Flies' graveyard", 165),
 ('Ladyfinger (biscuit)', 165),
 ('Madeleine (cake)', 165),
 ('Marry girl cake', 165),
 ('Meringue', 165),
 ('Mooncake', 165),
 ('Pandoro', 165),
 ('Paper wrapped cake', 165),
 ('Punschkrapfen', 165),
 ('Sweetheart cake', 165),
 ('Torta caprese', 165),
 ('Baumkuchen', 164),
 ('Layer cake', 164),
 ('Prinzregententorte', 164),
 ('Spettekaka', 164),
 ('Spit cake', 164),
 ('Trdelník', 164),
 ('Wedding cake', 164),
 ('Šakotis', 164),
 ('Banana bread', 163),
 ('Batik cake', 163),
 ('Bolo Rei', 163),
 ('Cheesecake', 163),
 ('Cupcake', 163),
 ('Dobos torte', 163),
 ('French Fancy', 163),
 ("Groom's cake", 163)]

For next time

Next class we are going to do some interactive graphics. Make sure that you can import the bokeh library.

In [31]:
from bokeh.plotting import figure, show, output_notebook, ColumnDataSource

If that produces an error, run the following either in your terminal or in the Anaconda prompt.

conda install bokeh

Also, next Tuesday we are going to start doing some network analysis. This means that we will need to use the networkx module, which is not included in the standard Anaconda Python installation. Please make sure that you have this downloaded correctly by running the following:

In [32]:
import networkx as nx

If there is a problem, please let me know before the end of class today.