In this very short tutorial, I'll introduce the changes I've made to the
wiki.py file (you'll need to download it once again). Make sure that the
follow code block runs without an error; it checks that you have an updated
version of the code.
import wiki assert wiki.__version__ >= 3
Before continuing, please delete all of the Wikipedia json files that you have downloaded. Trust me, it will be okay.
Now, I've made two major changes to the functions in
wiki.py. First of all, let's
see what happens when we download the Wikipedia page for Richmond:
data = wiki.get_wiki_json('Richmond,_Virginia')
Now look at the saved data on your laptop:
import os os.listdir(os.path.join(os.path.dirname(os.getcwd()), "data", "en"))[:10]
['Wang_Zhen_(Wang_Yiting).json.gz', 'Elizabeth_Montagu.json.gz', 'Mother_church.json.gz', 'Susan_Bulkeley_Butler.json.gz', 'Gertrude_Stein.json.gz', 'Alfred_Corn.json.gz', 'Scott_Turow.json.gz', 'Louis_Joseph_Vance.json.gz', 'Elliot_Paul.json.gz', 'Virginia_Center_for_Architecture.json.gz']
You should notice that the file is now saved as a
json.gz file. This is a
compressed version of the json file that can be read by Python. It is about
4 times smaller than the raw files and should help us keep your machines with
plenty of space as we use larger datasets.
As you noticed last class, it can take a while for everyone in class to download pages from Wikipedia. Its not so slow that you can't do it on your own time for specific projects, but becomes a pain when we are just sitting around waiting for pages to download (the internet in our room also slows down and we aren't being very nice to the MediaWiki API).
As an alternative, I've written the function
wiki.bulk_download. It allows you
to grab a zip file from my server that contains all of the Wikipedia pages for a
specific category. It should take care of all the details for you. Here is the
Help on function bulk_download in module wiki: bulk_download(name, lang='en', base_url='http://distantviewing.org/') Bulk download Wikipedia files Args: name: A character string describing the lang: Two letter language code describing the Wikipedia language used to grab the data. base_url: The URL path that contains the zip file. Returns: Number of files added to the archive.
Right now it points to the server that my Distant Viewing project sits, but I kept open the possibility that files may be put elsewhere in the future.
Now, let's test this code with the following, which downloads all of the pages from Richmond Virginia (hopefully my server can keep up!).
Added 0 files from an archive of 1126 files.
It should say that it added 1125 files from an archive of 1126. What happened with the last file? You already had it from when you downloaded the page for 'Richmond, Virginia'.
Here are two other bulk downloads that will be helpful to have:
Added 0 files from an archive of 202 files. Added 0 files from an archive of 10134 files.
The first should add 202 files and the second adds a much larger set of 10128 (out of 10134). The overlaping 6 pages are authors that were linked to from the Richmond page. We will be using all three sets of data over the next few classes.
One drawback of the way I wrote the function
wiki.bulk_download is that it has no
way of knowing whether you've already downloaded a file. Therefore, it is a good idea
to comment out the bulk download commands. We do this, recall, by adding the
at the start of the bulk download lines. Do this above to the three download commands,
so that they look like this: