Tutorial 08: Saving MediaWiki Requests

Today's note walk you through building a function to save MediaWiki requests on your machine.

Let's start by loading four modules that we will need for this tutorial. I will also grab by name the function join as we will need to use it quite frequently.

In [1]:
import json
import os
import re
import requests

from os.path import join

Making the request

Before we wrap up any fancy functions, let's replicate the API request that we made in your browser. Start here by defining the lang (language), the base_api_url, and the default_query parameters:

In [2]:
lang = 'en'
base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php'
default_query = 'action=parse&format=json&'

Now let's define the page title for the University of Richmond (and convert spaces to underscores if needed) and then build the full URL string:

In [3]:
page_title = "University of Richmond"
page_title = re.sub(" ", "_", page_title)
url = base_api_url + "?" + default_query + "page=" + page_title

Pay attention to how I am using the + operator to build a larger string from individual variables. The variable url should now contain the same URL request that we ran in the browser.

In [4]:
print(url)
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=University_of_Richmond

Now we are ready to make a request Wikipedia asking for the page level data. The format here is exactly the same as requesting an HTML page. You should get a response code of '200' indicating a succesful request.

In [5]:
r = requests.get(url)
r
Out[5]:
<Response [200]>

Now, because the data is in JSON format we need to use slightly different code to grab the data. Here, I'm calling the json method of the request and selecting just the 'parse' parameter (remember in your browser that the parse element was the top of the JSON tree).

In [6]:
page_data = r.json()['parse']

If we print this out, you'll see that it contains the same information as you had in your browser (but probably not formatted as well):

In [7]:
# print(page_data)

Dictionaries (briefly)

The data we call JSON in represented as an object type known as a dictionary (dict for short) in Python:

In [8]:
data = page_data
type(data)
Out[8]:
dict

We'll talk more about dictionaries next class, but I thought it would be useful for you to at least hear the term and see them in action today. You can access the 'elements' of the dictionary by using square brackets and the name of the element that you want to locate. For example, here are all of the images on our page:

In [9]:
data['images']
Out[9]:
['University_of_Richmond_seal.svg',
 'UR_Shield.svg',
 'Richmond-Rummell_View.jpg',
 'BoatwrightTower.jpg',
 'CommonsLake.jpg',
 'Commons-logo.svg',
 'Folder_Hexagonal_Icon.svg']

And here are all of the external links on the UR Wikipedia page:

In [10]:
data['externallinks'][:10]
Out[10]:
['https://www.richmond.edu/about/motto.html',
 'https://www.nacubo.org/-/media/Nacubo/Documents/EndowmentFiles/2017-Endowment-Market-Values.ashx?la=en&hash=E71088CDC05C76FCA30072DA109F91BBC10B0290',
 'http://ifx.richmond.edu/pdfs/factbook2015-16.pdf',
 'http://ifx.richmond.edu/pdfs/FB2-Enrollment.pdf',
 'https://web.archive.org/web/20150709201808/http://communications.richmond.edu/marks/colors.html',
 'http://communications.richmond.edu/marks/colors.html',
 'http://news.richmond.edu/features/article/-/10822/did-you-know-the-ur-spider-a-bite-of-history.html',
 'https://web.archive.org/web/20120927033702/http://www.dhr.virginia.gov/registers/Cities/Richmond/127-0045_Columbia_1982_Final_Nomination.pdf',
 'http://www.dhr.virginia.gov/registers/Cities/Richmond/127-0045_Columbia_1982_Final_Nomination.pdf',
 'http://www.lva.virginia.gov/public/guides/opac/gilletteabout.htm']

Next time we will see how to work with, save, and compute over these various elements.

Saving a dictionary / JSON file

First, let's just save our UR json file as a file under the tutorials directory on your laptop. The syntax below creates a file for writing named "ur.json" and saves the variable data into the file.

In [11]:
with open('ur.json', 'w') as outfile:
    json.dump(data, outfile)

Go back to the file browser in Python and check that the file was created. Click on it and verify that the JSON data is available.

Now, let's load the file back into Python. This time we open a file for reading and save it as the object new_data.

In [12]:
with open('ur.json', 'r') as infile:
    new_data = json.load(infile)

When you print out the object new_data it should contain the same information as the original data.

In [13]:
# new_data

Correct file path

Now, we need a function that returns the corect path that a JSON file should be stored at based on its title and the Wikipedia language. This is a bit tricky, particularly if you want it to run correctly across various operating systems, so I'll just give you the correct code here. But, you should be able to understand most of the parts and what is going on.

In [14]:
def wiki_json_path(page_title, lang='en'):
    """Returns local path to JSON file for Wikipeida page data
    
    This function is used to determine where the dump of a 
    call to the MediaWiki API, using the parse method, should
    be stored. As an extra action, the function also checks that
    the relevant directory exists and creates it if it does not.
    
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A string describing a relative path to file.
    """
    page_title = re.sub(" ", "_", page_title)
    stat289_base_dir = os.path.dirname(os.getcwd())
    
    dir_name = join(stat289_base_dir, "data", lang)
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
        
    file_name = page_title + ".json"
    file_path = join(dir_name, file_name)
    
    return file_path

We can test out that the function works as expected here:

In [15]:
wiki_json_path("Univeristy of Richmond")
Out[15]:
'/Users/taylor/gh/stat289-fall-2018-statsmaths/data/en/Univeristy_of_Richmond.json'

Note: I believe this code works as given on Windows, but I was not able to test it directly. Please let me know if you have any issues with the code.

Using the MediaWiki API

Now, let's define a function that grabs the data from the MediaWiki API and either loads the file from your computer or pulls it from Wikipedia.

We'll need this function a lot, and some of the details haven't been covered in the tutorials yet, so I'll provide the code directly here. Notice that I've split the code into two parts to improve readability.

In [16]:
def get_mediawiki_request(page_title, lang):
    """Returns URL to make parse request to the MediaWiki API
        
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A string giving the complete request URL.
    """
    page_title = re.sub(" ", "_", page_title)
    base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php'
    default_query = 'action=parse&format=json&'

    url = base_api_url + "?" + default_query + "page=" + page_title
    return url
In [17]:
def get_wiki_json(page_title, lang='en'):
    """Returns JSON data as a dictionary for the Wikipedia page
    
    This function either loads a cached version of the page or,
    if a local version of the page is not available, calls the
    MediaWiki API directly.
    
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A dictionary object with the complete parsed JSON data.
    """
    file_path = wiki_json_path(page_title, lang)
    
    # if page does not exist, grab it from Wikipedia
    if not os.path.exists(file_path):
        print("Pulling data from MediaWiki API...")
        url = get_mediawiki_request(page_title, lang)
        r = requests.get(url)
        page_data = r.json()['parse']
        with open(file_path, 'w') as outfile:
            json.dump(page_data, outfile)
            
    # read the JSON data from local filesystem
    with open(file_path, 'r') as infile:
        new_data = json.load(infile)
    
    return new_data

Try to run the code using the chunk below. The first time its run it should print out a message; afterwards it should load the file directly.

In [18]:
data = get_wiki_json("Water")
Pulling data from MediaWiki API...

Test the function further by also grabbing the following pages.

In [19]:
data = get_wiki_json("University of Virginia")
data = get_wiki_json("Virginia Commonwealth University")
data = get_wiki_json("Virginia Union University")
Pulling data from MediaWiki API...
Pulling data from MediaWiki API...
Pulling data from MediaWiki API...

Open your 'data' directory and look in the directory for 'en'. Make sure that your data shows up correctly. Also make sure that re-running the code does not print out the message "Pulling data from MediaWiki API...".