Tutorial 21: Customizing wikitext exlorer

The third and four projects require you to build a corpus of pages and apply the Wiki Text Explorer code to this corpus. Today I will show you a few other details for putting together your projects. Start by loading the

In [1]:
import wiki
import iplot
import wikitext
Loading BokehJS ...
In [2]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 2
assert iplot.__version__ >= 3

The new wikitext module include a function get_internal_links that takes a link to a Wikipedia page and returns a dictionary with three elements.

In [3]:
links_us = wikitext.get_internal_links('History_of_the_United_States')
links_us.keys()
Out[3]:
dict_keys(['ilinks', 'ilinks_p', 'ilinks_li'])

The element 'ilinks' includes all links given on the page. The 'ilinks_p' are all links contained within paragraph tags, and 'ilinks_li' are all links given inside of list items (before bullet points). We can use all of these to build a corpus of interest for your projects.

Customizing topic and cluster names

Now, let's say that you've created a WikiCorpus object, such as this one:

In [4]:
wcorp = wikitext.WikiCorpus(links_us['ilinks_p'])

There is a method provided by wcorp that produces a dictionary object with template names for all of the topics and clusters in our corpus. Here we will use it to store the file as a file named 'history-us.json':

In [5]:
import json

with open('history-us.json', 'w') as fout:
    json.dump(wcorp.json_meta_template(), fout, indent=2)

Now, open the file through Jupyter notebook (or your favorite text editor). Change the name of the first topic and now create the Text Explorer output:

In [6]:
wikitext.wiki_text_explorer(wcorp, input_file="history-us.json")

You should see that the page now names the first topic whatever you renamed it to. A large part of the next project involves constructing names for all of your topics.