Tutorial 12: Networks and Algorithms

In this very short tutorial, I'll introduce the changes I've made to the wiki.py file (you'll need to download it once again). Make sure that the follow code block runs without an error; it checks that you have an updated version of the code.

In [1]:
import wiki

assert wiki.__version__ >= 3

Before continuing, please delete all of the Wikipedia json files that you have downloaded. Trust me, it will be okay.

Now, I've made two major changes to the functions in wiki.py. First of all, let's see what happens when we download the Wikipedia page for Richmond:

In [2]:
data = wiki.get_wiki_json('Richmond,_Virginia')

Now look at the saved data on your laptop:

In [3]:
import os

os.listdir(os.path.join(os.path.dirname(os.getcwd()), "data", "en"))[:10]

You should notice that the file is now saved as a json.gz file. This is a compressed version of the json file that can be read by Python. It is about 4 times smaller than the raw files and should help us keep your machines with plenty of space as we use larger datasets.

Bulk downloads

As you noticed last class, it can take a while for everyone in class to download pages from Wikipedia. Its not so slow that you can't do it on your own time for specific projects, but becomes a pain when we are just sitting around waiting for pages to download (the internet in our room also slows down and we aren't being very nice to the MediaWiki API).

As an alternative, I've written the function wiki.bulk_download. It allows you to grab a zip file from my server that contains all of the Wikipedia pages for a specific category. It should take care of all the details for you. Here is the help page:

In [4]:
Help on function bulk_download in module wiki:

bulk_download(name, lang='en', base_url='http://distantviewing.org/')
    Bulk download Wikipedia files
        name: A character string describing the
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
        base_url: The URL path that contains the zip file.
        Number of files added to the archive.

Right now it points to the server that my Distant Viewing project sits, but I kept open the possibility that files may be put elsewhere in the future.

Now, let's test this code with the following, which downloads all of the pages from Richmond Virginia (hopefully my server can keep up!).

In [5]:
Added 0 files from an archive of 1126 files.

It should say that it added 1125 files from an archive of 1126. What happened with the last file? You already had it from when you downloaded the page for 'Richmond, Virginia'.

Here are two other bulk downloads that will be helpful to have:

In [6]:
Added 0 files from an archive of 202 files.
Added 0 files from an archive of 10134 files.

The first should add 202 files and the second adds a much larger set of 10128 (out of 10134). The overlaping 6 pages are authors that were linked to from the Richmond page. We will be using all three sets of data over the next few classes.

Final task

One drawback of the way I wrote the function wiki.bulk_download is that it has no way of knowing whether you've already downloaded a file. Therefore, it is a good idea to comment out the bulk download commands. We do this, recall, by adding the # symbol at the start of the bulk download lines. Do this above to the three download commands, so that they look like this:

In [7]:
# wiki.bulk_download('richmond-va')