Tutorial 20: Wiki Text Explorer

Its good to understand all of the ideas in Tutorial 19. However, I have already wrapped up all of the functionality show there in a helpful new refactoring of the wikitext module. That is, you'll get to focus on the text analysis itself and not just the code (yay!).

Make sure that you download a new version of wiki.py, iplot.py and wikitext.py and load all of them into Python:

In [1]:
import wiki
import iplot
import wikitext
Loading BokehJS ...
In [2]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 2
assert iplot.__version__ >= 3

Get some data

For today, make sure that you download the following data. Uncomment the line, run it, and then recomment it so that it only runs once.

In [3]:
#wiki.bulk_download('history-us', force=True)

The WikiCorpus class

I have wrapped up all of the text analysis tools we should need for the next few weeks in a single class called WikiCorpus. Let's grab a set of links to start with:

In [4]:
links = wikitext.get_internal_links('History_of_the_United_States')['ilinks_p']

Now create a WikiCorpus object as follows (it's running a lot of code; it make take a minute or two to run):

In [5]:
wcorp = wikitext.WikiCorpus(links)
print(wcorp)
WikiCorpus object with '739' documents and lexicon with '23243' terms.

You should see that it has pulled 282 documents are over 9000 terms. The object has methods and attributes that provide text analysis tools such as document similarity and LDA topics. An easier way to see all of this is to simply create a Wiki Text Explorer webpage like this:

In [6]:
wikitext.wiki_text_explorer(wcorp)

Now, you should see a new directory called 'text-explore' in your tutorial folder. Open the index page in your browser and play around with the tabs at the top. Notice that a lot of things are clickable links to other parts of the page. Spend some real time understanding what is going on with each of the components in the Exporer.

Modify the WikiCorpus object

There are many options that you can set that change how the WikiCorpus object is constructed. The two most powerful change the number of topics and clusters created by the module. Here I will make the number of topics and clusters 20 instead of the default 40:

In [7]:
wcorp = wikitext.WikiCorpus(links, num_topics=20, num_clusters=20)

Now, re-create the page and see how the results change:

In [8]:
wikitext.wiki_text_explorer(wcorp)