Today's note walk you through building a function to save MediaWiki requests on your machine.
Let's start by loading four modules that we will need for this
tutorial. I will also grab by name the function
join as we will
need to use it quite frequently.
import json import os import re import requests from os.path import join
Before we wrap up any fancy functions, let's replicate
the API request that we made in your browser. Start here
by defining the
lang (language), the
lang = 'en' base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php' default_query = 'action=parse&format=json&'
Now let's define the page title for the University of Richmond (and convert spaces to underscores if needed) and then build the full URL string:
page_title = "University of Richmond" page_title = re.sub(" ", "_", page_title) url = base_api_url + "?" + default_query + "page=" + page_title
Pay attention to how I am using the
+ operator to build a larger
string from individual variables. The variable
url should now
contain the same URL request that we ran in the browser.
Now we are ready to make a request Wikipedia asking for the page level data. The format here is exactly the same as requesting an HTML page. You should get a response code of '200' indicating a succesful request.
r = requests.get(url) r
Now, because the data is in JSON format we need to use slightly different code to grab the data. Here, I'm calling the json method of the request and selecting just the 'parse' parameter (remember in your browser that the parse element was the top of the JSON tree).
page_data = r.json()['parse']
If we print this out, you'll see that it contains the same information as you had in your browser (but probably not formatted as well):
The data we call JSON in represented as an object type known as a dictionary (dict for short) in Python:
data = page_data type(data)
We'll talk more about dictionaries next class, but I thought it would be useful for you to at least hear the term and see them in action today. You can access the 'elements' of the dictionary by using square brackets and the name of the element that you want to locate. For example, here are all of the images on our page:
['University_of_Richmond_seal.svg', 'UR_Shield.svg', 'Richmond-Rummell_View.jpg', 'BoatwrightTower.jpg', 'CommonsLake.jpg', 'Commons-logo.svg', 'Folder_Hexagonal_Icon.svg']
And here are all of the external links on the UR Wikipedia page:
['https://www.richmond.edu/about/motto.html', 'https://www.nacubo.org/-/media/Nacubo/Documents/EndowmentFiles/2017-Endowment-Market-Values.ashx?la=en&hash=E71088CDC05C76FCA30072DA109F91BBC10B0290', 'http://ifx.richmond.edu/pdfs/factbook2015-16.pdf', 'http://ifx.richmond.edu/pdfs/FB2-Enrollment.pdf', 'https://web.archive.org/web/20150709201808/http://communications.richmond.edu/marks/colors.html', 'http://communications.richmond.edu/marks/colors.html', 'http://news.richmond.edu/features/article/-/10822/did-you-know-the-ur-spider-a-bite-of-history.html', 'https://web.archive.org/web/20120927033702/http://www.dhr.virginia.gov/registers/Cities/Richmond/127-0045_Columbia_1982_Final_Nomination.pdf', 'http://www.dhr.virginia.gov/registers/Cities/Richmond/127-0045_Columbia_1982_Final_Nomination.pdf', 'http://www.lva.virginia.gov/public/guides/opac/gilletteabout.htm']
Next time we will see how to work with, save, and compute over these various elements.
First, let's just save our UR json file as a file under the tutorials directory on your laptop. The syntax below creates a file for writing named "ur.json" and saves the variable data into the file.
with open('ur.json', 'w') as outfile: json.dump(data, outfile)
Go back to the file browser in Python and check that the file was created. Click on it and verify that the JSON data is available.
Now, let's load the file back into Python. This time we open a
file for reading and save it as the object
with open('ur.json', 'r') as infile: new_data = json.load(infile)
When you print out the object
new_data it should contain
the same information as the original data.
Now, we need a function that returns the corect path that a JSON file should be stored at based on its title and the Wikipedia language. This is a bit tricky, particularly if you want it to run correctly across various operating systems, so I'll just give you the correct code here. But, you should be able to understand most of the parts and what is going on.
def wiki_json_path(page_title, lang='en'): """Returns local path to JSON file for Wikipeida page data This function is used to determine where the dump of a call to the MediaWiki API, using the parse method, should be stored. As an extra action, the function also checks that the relevant directory exists and creates it if it does not. Args: page_title: A string containing the page title. lang: Two letter language code describing the Wikipedia language used to grab the data. Returns: A string describing a relative path to file. """ page_title = re.sub(" ", "_", page_title) stat289_base_dir = os.path.dirname(os.getcwd()) dir_name = join(stat289_base_dir, "data", lang) if not os.path.exists(dir_name): os.makedirs(dir_name) file_name = page_title + ".json" file_path = join(dir_name, file_name) return file_path
We can test out that the function works as expected here:
wiki_json_path("Univeristy of Richmond")
Note: I believe this code works as given on Windows, but I was not able to test it directly. Please let me know if you have any issues with the code.
Now, let's define a function that grabs the data from the MediaWiki API and either loads the file from your computer or pulls it from Wikipedia.
We'll need this function a lot, and some of the details haven't been covered in the tutorials yet, so I'll provide the code directly here. Notice that I've split the code into two parts to improve readability.
def get_mediawiki_request(page_title, lang): """Returns URL to make parse request to the MediaWiki API Args: page_title: A string containing the page title. lang: Two letter language code describing the Wikipedia language used to grab the data. Returns: A string giving the complete request URL. """ page_title = re.sub(" ", "_", page_title) base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php' default_query = 'action=parse&format=json&' url = base_api_url + "?" + default_query + "page=" + page_title return url
def get_wiki_json(page_title, lang='en'): """Returns JSON data as a dictionary for the Wikipedia page This function either loads a cached version of the page or, if a local version of the page is not available, calls the MediaWiki API directly. Args: page_title: A string containing the page title. lang: Two letter language code describing the Wikipedia language used to grab the data. Returns: A dictionary object with the complete parsed JSON data. """ file_path = wiki_json_path(page_title, lang) # if page does not exist, grab it from Wikipedia if not os.path.exists(file_path): print("Pulling data from MediaWiki API...") url = get_mediawiki_request(page_title, lang) r = requests.get(url) page_data = r.json()['parse'] with open(file_path, 'w') as outfile: json.dump(page_data, outfile) # read the JSON data from local filesystem with open(file_path, 'r') as infile: new_data = json.load(infile) return new_data
Try to run the code using the chunk below. The first time its run it should print out a message; afterwards it should load the file directly.
data = get_wiki_json("Water")
Pulling data from MediaWiki API...
Test the function further by also grabbing the following pages.
data = get_wiki_json("University of Virginia") data = get_wiki_json("Virginia Commonwealth University") data = get_wiki_json("Virginia Union University")
Pulling data from MediaWiki API... Pulling data from MediaWiki API... Pulling data from MediaWiki API...
Open your 'data' directory and look in the directory for 'en'. Make sure that your data shows up correctly. Also make sure that re-running the code does not print out the message "Pulling data from MediaWiki API...".