## Tutorial 18: Creating the wikitext module¶

Recall that I created the module wiki.py in order to wrap-up and make easily usable all of the functions for interacting with the MediaWiki API. I then also created the module iplot.py for working with interactive data visualizations. We need a similar module for working with textual data from a corpus of Wikipedia pages. However, this time you are going to try to create this module yourself.

### Create wikitext.py¶

Start by constructing an empty module named wikitext.py. We will turn on the autoreload function and import the empty module.

In [1]:
%load_ext autoreload

In [2]:
import wikitext


Now, add this line into your module to indicate that this is version one of the code:

In [3]:
__version__ = 1


Save the file, and make sure that everything is working correctly (and autoreloading) by checking the version string:

In [4]:
wikitext.__version__

Out[4]:
1

### Cleaning text¶

To start our module, let's create a function that gets rid of the newline characters and numeric references (numbers in square brackets). For example, here is a short snippet of text from the 'Plato' page:

In [5]:
text = """Western religion and spirituality.[6]\n"""
text

Out[5]:
'Western religion and spirituality.[6]\n'

Write a function named clean_text in the wikitext.py module that takes one string argument and returns a cleaned string with the newlines and references removed. You can informally test using this code:

In [6]:
wikitext.clean_text(text)

Out[6]:
'Western religion and spirituality.'

When you think you have the correct code, test your function by running the following code lines:

In [7]:
assert wikitext.clean_text(text) == 'Western religion and spirituality.'
assert wikitext.clean_text('Some[120] more.') == 'Some more.'
assert wikitext.clean_text('And\n again.') == 'And again.'


If your code is working as expected, the block of code will not produce anything. Only if there is an error will something appear.

### List of paragraphs¶

Next, we'll create a function link_to_p that takes the name of a Wikipedia page as an input and returns a list of the paragraphs in the text. That is, each element of the list is a string containing the text of a paragraph. You can test the function with this code (it shows the first three paragraph of the 'Plato' page):

In [8]:
wikitext.link_to_p('Plato')[:3]

Out[8]:
["Plato (/ˈpleɪtoʊ/;[a] Greek: Πλάτων[a] Plátōn, pronounced\xa0[plá.tɔːn] in Classical Attic; 428/427 or 424/423[b] – 348/347 BC) was a philosopher in Classical Greece and the founder of the Academy in Athens, the first institution of higher learning in the Western world. He is widely considered the pivotal figure in the development of Western philosophy. Unlike nearly all of his philosophical contemporaries, Plato's entire work is believed to have survived intact for over 2,400 years.",
'Along with his teacher, Socrates, and his most famous student, Aristotle, Plato laid the foundations of Western philosophy and science. Alfred North Whitehead once noted: "the safest general characterization of the European philosophical tradition is that it consists of a series of footnotes to Plato." In addition to being a foundational figure for Western science, philosophy, and mathematics, Plato has also often been cited as one of the founders of Western religion and spirituality.',
"Plato was the innovator of the written dialogue and dialectic forms in philosophy. Plato appears to have been the founder of Western political philosophy, with his Republic, and Laws among other dialogues, providing some of the earliest extant treatments of political questions from a philosophical perspective. Plato's own most decisive philosophical influences are usually thought to have been Socrates, Parmenides, Heraclitus and Pythagoras, although few of his predecessors' works remain extant and much of what we know about these figures today derives from Plato himself."]

Make sure that your function does these two things:

• calls the clean_text function on each block of text
• does not return paragraphs that are empty after cleaning

Once you have that worked out, test the code with the test below.

In [9]:
paragraphs = wikitext.link_to_p('Plato')[:3]

assert paragraphs[1][:22] == 'Along with his teacher'
assert paragraphs[1][-13:] == 'spirituality.'


Again, the code works if the above does not produce any output.

### Entire document¶

While it is often useful to have the text within each paragraph seperated, more often we will want to extract the entire text as a whole. Write a new function link_to_doc that returns the entire paragraph text as a single string. Hint: The easiest way to do this is to call the function link_to_p and then collapse the results using the join function.

First, try your code with this:

In [10]:
wikitext.link_to_doc('Plato')[:1000]

Out[10]:
'Plato (/ˈpleɪtoʊ/;[a] Greek: Πλάτων[a] Plátōn, pronounced\xa0[plá.tɔːn] in Classical Attic; 428/427 or 424/423[b] – 348/347 BC) was a philosopher in Classical Greece and the founder of the Academy in Athens, the first institution of higher learning in the Western world. He is widely considered the pivotal figure in the development of Western philosophy. Unlike nearly all of his philosophical contemporaries, Plato\'s entire work is believed to have survived intact for over 2,400 years. Along with his teacher, Socrates, and his most famous student, Aristotle, Plato laid the foundations of Western philosophy and science. Alfred North Whitehead once noted: "the safest general characterization of the European philosophical tradition is that it consists of a series of footnotes to Plato." In addition to being a foundational figure for Western science, philosophy, and mathematics, Plato has also often been cited as one of the founders of Western religion and spirituality. Plato was the innovator '

Then run these tests once you think you are finished with the code.

In [11]:
doc = wikitext.link_to_doc('Plato')

assert type(doc) == str
assert len(doc) == 64299
assert doc[:5] == 'Plato'


### docstrings¶

Go back to the module and make sure that you have full docstrings on all of the functions in the module. These should gives a sentence describing what the function does, followed by the input argument, then what the results are.

Once you have the three functions written, check your code with the pycodestyle and pylint modules.

In [12]:
import pycodestyle
pycodestyle.Checker(filename='wikitext.py').check_all()

Out[12]:
0
In [13]:
from pylint.epylint import lint
lint("wikitext.py")

 --------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Out[13]:
0

Try to fix all of the issues given by these modules. I suggest fixing the code style issues first, followed by the pylint warnings. If a warning does not make sense to you, just ask!

### More practice¶

Let's write another function! We will write a function named link_to_plinks that takes a Wikipedia link and returns a list of all the internal links on the page that are given somewhere inside of a paragraph tag. This will avoid, for example, extraneous links at the bottom and sides of the page. However, we want to ensure a few things about the results:

• only return the page name (i.e., not the '/wiki/' part)
• only return internal links and 'real' pages; use the list of internal links for this
• return a list with no duplicates
• sort the list in the output, and make sure the result is a 'list'

You'll probably want to work on this function in stages. That is, returning all of the links at first and then building if statements to filter out exactly what we want. Note that you'll need to replace spaces with underscores in the list of links provided from the Wikipedia JSON file.

You can test your code with:

In [14]:
wikitext.link_to_plinks('Plato')[:25]

Out[14]:
['Abstraction',
'Achaemenid_Empire',
'Aegina',
'Afterlife',
'Al-Farabi',
'Albert_Einstein',
'Alexander_of_Aphrodisias',
'Alfred_Tarski',
'Allegory_of_the_Cave',
'Alonzo_Church',
'Ambrose',
'Anamnesis_(philosophy)',
'Ancient_Athens',
'Ancient_Greek_philosophy',
'Anytus',
'Apollo',
'Apology_(Plato)',
'Aporia',
'Applied_mathematics',
'Apuleius',
'Archetype']

You should find that the following code will run if you have correctly defined the link_to_plinks function.

In [15]:
ilinks = wikitext.link_to_plinks('Plato')



Finally, ensure that you have a docstring for the function and the pycodestyle and pylint modules produce no errors.

In [16]:
import pycodestyle
pycodestyle.Checker(filename='wikitext.py').check_all()

Out[16]:
0
In [17]:
from pylint.epylint import lint
lint("wikitext.py")

 --------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Out[17]:
0

### Even more practice¶

The above steps should take some time to get done correctly. If you would like even more practice with building and testing functions for working with XML and textual data, here is one more task. Build a function link_to_geo that takes a name of a Wikipedia page and returns either the latitude and longitude associated with the page or, if there is no geographic information, returns the object None.

You can test your code with the 'London' page:

In [18]:
wikitext.link_to_geo('London')

Out[18]:
(51.50722, -0.1275)

At first, try to just spit out the coordinates as a string as given by Wikipedia. Then, make sure you that you correctly return None when given a page like Plato:

In [19]:
type(wikitext.link_to_geo('Plato'))

Out[19]:
NoneType

Finally, when there is coordinate information, split the string into latitude and longitude and return the result as a tuple (just use return lat, lon in the code). To test, check that we have:

In [20]:
lat, lon = wikitext.link_to_geo('London')

assert type(lat) == float
assert type(lon) == float
assert abs(lat - 51.50722) < 0.4
assert abs(lon - -0.12750) < 0.4


And, as usual, make sure that you have full docstrings and the code produces no warnings when running pylint.

### Even more even more practice¶

Just one more function, in case you have time (and this will take some thinking to get correct). Produce a function wikitext.link_to_section that returns a list with the text of the page split into sections. You can start by just returning a list of strings, but eventually you should return a list of tuples given ('heading name', 'heading text').

A good place to start is to consider why this problem is significantly harder than spliting the text into paragraph (at least, I thought it was; maybe you'll be better at it than I was!).

In [21]:
doc = wikitext.link_to_section('Plato')

In [22]:
doc[0]['heading']

Out[22]:
''
In [23]:
doc[0]['text']

Out[23]:
'Plato (/ˈpleɪtoʊ/;[a] Greek: Πλάτων[a] Plátōn, pronounced\xa0[plá.tɔːn] in Classical Attic; 428/427 or 424/423[b] – 348/347 BC) was a philosopher in Classical Greece and the founder of the Academy in Athens, the first institution of higher learning in the Western world. He is widely considered the pivotal figure in the development of Western philosophy. Unlike nearly all of his philosophical contemporaries, Plato\'s entire work is believed to have survived intact for over 2,400 years.Along with his teacher, Socrates, and his most famous student, Aristotle, Plato laid the foundations of Western philosophy and science. Alfred North Whitehead once noted: "the safest general characterization of the European philosophical tradition is that it consists of a series of footnotes to Plato." In addition to being a foundational figure for Western science, philosophy, and mathematics, Plato has also often been cited as one of the founders of Western religion and spirituality.Plato was the innovator of the written dialogue and dialectic forms in philosophy. Plato appears to have been the founder of Western political philosophy, with his Republic, and Laws among other dialogues, providing some of the earliest extant treatments of political questions from a philosophical perspective. Plato\'s own most decisive philosophical influences are usually thought to have been Socrates, Parmenides, Heraclitus and Pythagoras, although few of his predecessors\' works remain extant and much of what we know about these figures today derives from Plato himself.The Stanford Encyclopedia of Philosophy describes Plato as "...one of the most dazzling writers in the Western literary tradition and one of the most penetrating, wide-ranging, and influential authors in the history of philosophy. ... He was not the first thinker or writer to whom the word “philosopher” should be applied. But he was so self-conscious about how philosophy should be conceived, and what its scope and ambitions properly are, and he so transformed the intellectual currents with which he grappled, that the subject of philosophy, as it is often conceived—a rigorous and systematic examination of ethical, political, metaphysical, and epistemological issues, armed with a distinctive method—can be called his invention. Few other authors in the history of Western philosophy approximate him in depth and range: perhaps only Aristotle (who studied with him), Aquinas and Kant would be generally agreed to be of the same rank."'
In [24]:
doc[2]['heading']

Out[24]:
'Intellectual influences on Plato'
In [25]:
doc[2]['text']

Out[25]:
'Although Socrates influenced Plato directly as related in the dialogues, the influence of Pythagoras upon Plato also appears to have significant discussion in the philosophical literature. Pythagoras, or in a broader sense, the Pythagoreans, allegedly exercised an important influence on the work of Plato. According to R. M. Hare, this influence consists of three points: (1) The platonic Republic might be related to the idea of "a tightly organized community of like-minded thinkers", like the one established by Pythagoras in Croton. (2) There is evidence that Plato possibly took from Pythagoras the idea that mathematics and, generally speaking, abstract thinking is a secure basis for philosophical thinking as well as "for substantial theses in science and morals". (3) Plato and Pythagoras shared a "mystical approach to the soul and its place in the material world". It is probable that both were influenced by Orphism.Pythagoras held that all things are number, and the cosmos comes from numerical principles. The physical world of becoming is an imitation of the mathematical world of being. These ideas were very influential on Heraclitus, Parmenides and Plato. Aristotle claimed that the philosophy of Plato closely followed the teachings of the Pythagoreans, and Cicero repeats this claim: "They say Plato learned all things Pythagorean" (Platonem ferunt didicisse Pythagorea omnia).George Karamanolis notes thatThese two philosophers, following the way initiated by pre-Socratic Greek philosophers like Pythagoras, depart from mythology and begin the metaphysical tradition that strongly influenced Plato and continues today.The surviving fragments written by Heraclitus suggest the view that all things are continuously changing, or becoming. His image of the river, with ever-changing waters, is well known.  According to some ancient traditions like that of Diogenes Laërtius, Plato received these ideas through Heraclitus\' disciple Cratylus, who held the more radical view that continuous change warrants skepticism because we cannot define a thing that does not have a permanent nature.Parmenides adopted an altogether contrary vision, arguing for the idea of changeless Being and the view that change is an illusion. John Palmer notes "Parmenides’ distinction among the principal modes of being and his derivation of the attributes that must belong to what must be, simply as such, qualify him to be seen as the founder of metaphysics or ontology as a domain of inquiry distinct from theology."These ideas about change and permanence, or becoming and Being, influenced Plato in formulating his theory of forms. According to this theory, there is a world of perfect, eternal, and changeless forms, the realm of Being, and an imperfect sensible world of becoming that partakes of the qualities of the forms, and is its instantiation in the sensible world.The precise relationship between Plato and Socrates remains an area of contention among scholars. Plato makes it clear in his Apology of Socrates that he was a devoted young follower of Socrates. In that dialogue, Socrates is presented as mentioning Plato by name as one of those youths close enough to him to have been corrupted, if he were in fact guilty of corrupting the youth, and questioning why their fathers and brothers did not step forward to testify against him if he was indeed guilty of such a crime (33d–34a). Later, Plato is mentioned along with Crito, Critobolus, and Apollodorus as offering to pay a fine of 30 minas on Socrates\' behalf, in lieu of the death penalty proposed by Meletus (38b). In the Phaedo, the title character lists those who were in attendance at the prison on Socrates\' last day, explaining Plato\'s absence by saying, "Plato was ill". (Phaedo 59b)Plato never speaks in his own voice in his dialogues. In the Second Letter, it says, "no writing of Plato exists or ever will exist, but those now said to be his are those of a Socrates become beautiful and new" (341c); if the Letter is Plato\'s, the final qualification seems to call into question the dialogues\' historical fidelity. In any case, Xenophon and Aristophanes seem to present a somewhat different portrait of Socrates from the one Plato paints. Some have called attention to the problem of taking Plato\'s Socrates to be his mouthpiece, given Socrates\' reputation for irony and the dramatic nature of the dialogue form.Aristotle attributes a different doctrine with respect to Forms to Plato and Socrates (Metaphysics 987b1–11). Aristotle suggests that Socrates\' idea of forms can be discovered through investigation of the natural world, unlike Plato\'s Forms that exist beyond and outside the ordinary range of human understanding.'
In [26]:
import pycodestyle
pycodestyle.Checker(filename='wikitext.py').check_all()

Out[26]:
0
In [27]:
from pylint.epylint import lint
lint("wikitext.py")

 --------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Out[27]:
0