Its good to understand all of the ideas in Tutorial 19. However, I have
already wrapped up all of the functionality show there in a helpful new
refactoring of the
wikitext module. That is, you'll get to focus on
the text analysis itself and not just the code (yay!).
Make sure that you download a new version of
wikitext.py and load all of them into Python:
assert wiki.__version__ >= 6 assert wikitext.__version__ >= 2 assert iplot.__version__ >= 3
For today, make sure that you download the following data. Uncomment the line, run it, and then recomment it so that it only runs once.
I have wrapped up all of the text analysis tools we should need for the next
few weeks in a single class called
WikiCorpus. Let's grab a set of links to
links = wikitext.get_internal_links('History_of_the_United_States')['ilinks_p']
Now create a
WikiCorpus object as follows (it's running a lot of code;
it make take a minute or two to run):
wcorp = wikitext.WikiCorpus(links) print(wcorp)
WikiCorpus object with '739' documents and lexicon with '23243' terms.
You should see that it has pulled 282 documents are over 9000 terms. The object has methods and attributes that provide text analysis tools such as document similarity and LDA topics. An easier way to see all of this is to simply create a Wiki Text Explorer webpage like this:
Now, you should see a new directory called 'text-explore' in your tutorial folder. Open the index page in your browser and play around with the tabs at the top. Notice that a lot of things are clickable links to other parts of the page. Spend some real time understanding what is going on with each of the components in the Exporer.
There are many options that you can set that change how the WikiCorpus object is constructed. The two most powerful change the number of topics and clusters created by the module. Here I will make the number of topics and clusters 20 instead of the default 40:
wcorp = wikitext.WikiCorpus(links, num_topics=20, num_clusters=20)
Now, re-create the page and see how the results change: