Extensible Markup Language, more commonly know as XML, is a standard format for structuring documents and information. One particular extension is XHTML, a standard used to describe the content of webpages.
We have already worked a bit with parsing the (X)HTML code returned from the MediaWiki API using regular expressions. Regular expressions are a great way to start, but for more extensive use a proper library that fully parses XML offers much more control and avoid common pitfalls.
In these notes we see how to use the
xml module to parse the text
returned from Wikipedia.
It will be easier to understand how to parse XML code using a smaller example than we would get from Wikipedia. Here is a very simple snippet of code that contains a title and two paragraphs.
html = """<div> <h1 class='page'><i>A title in italics</i></h1> <p>Here is one paragraph of text with something in a <b>bold</b> font.</p> <p> Another paragraph! In this case I have <a href="https://github.com">a link</a> that <i id='my'>you</i> click on.</p> </div> """ print(html)
We start by reading in the submodule
xml.etree.ElementTree. By convention,
we'll save it as
import xml.etree.ElementTree as ET
Next, use the
fromstring function to take the string and convert it into an
tree = ET.fromstring(html) type(tree)
The object has three element corresponding to the three top-level elements
in the XML. Elements are accessed the same way they would be in a list: with
square brackets and an index. The first element is our
<Element 'h1' at 0x111c48b88>
The second elements two are paragraph tags:
<Element 'p' at 0x111c5c278> <Element 'p' at 0x111c5c318>
Also like a list, we can cycle through the elements with a
for child in tree: print(child)
<Element 'h1' at 0x111c48b88> <Element 'p' at 0x111c5c278> <Element 'p' at 0x111c5c318>
Finally, we can also manually convert the tree to a list:
[<Element 'h1' at 0x111c48b88>, <Element 'p' at 0x111c5c278>, <Element 'p' at 0x111c5c318>]
Typically there is not much reason to manually convert an
into a list in your final code, but it can be very useful when testing
child = tree
There are several useful properties given to us by the element.
tag properties of the element is a string giving the type
attrib property is a dictionary that yields the properties (if there are any)
of the XML tag. Looking at the 'h1' element in the example, we see that there is an
attribute named 'class' that's equal to 'page'.
Finally, the property
text contains the actual text inside of the element.
You should notice that there is no text in the tag? What's going on here?!
If you look at the XML input, there is an 'i' tag inside of the 'h1' tag and all
of the text is inside of this tag. We can see all of the elements inside of 'h1',
as above, using the
[<Element 'i' at 0x111c5c1d8>]
This child of the child has a tag equal to 'i' (its an italic symbol in HTML):
But no attributes:
However, it does have a text property containg the actual text:
'A title in italics'
Let's now work with the first paragraph element:
child = tree child.tag
As you should expect, it has a 'b' (bold) element inside of it:
[<Element 'b' at 0x111c5c2c8>]
What happens if we try to grab the text?
'Here is one paragraph of text with something in a '
It only contains the text up to the 'b' tag, similar to what happened
with the title element... This could get be very difficult to work with
if we wanted all of the information inside of a paragraph or other element.
The solution is to use the method
itertext; it (when converted into a
list) returns all of the text inside of an element.
['Here is one paragraph of text with something in a ', 'bold', ' font.']
The individual elements can be combined by using the function
'Here is one paragraph of text with something in a bold font.'
We now have the basic elements for working with an XML document. If we wanted,
for example, to get a list with one element for each paragraph we could use a
for loop and
p =  for child in tree: if child.tag == "p": text = "".join(child.itertext()) p.append(text) p
['Here is one paragraph of text with something in a bold font.', ' Another paragraph! In this case I have a link that you click on.']
For some applications, this approach (cycling through children) is ideal. One drawback, however, is that it becomes difficult to find elements that might be buried deeper in the XML tree. For example, if we wanted all links in the document.
A way to address this is to use a notation called an XPath Expression that describes a element in an XML document. We won't go into the full spec for XPath expression, but will show a few examples that will be most useful.
To use an XPath expression to find nodes in an
ElementTree, we use the
findall method. A simply query simply just starts with './/' (this means
that the tag can start anywhere) and includes the name of the tags that you
want to find:
[<Element 'i' at 0x111c5c1d8>, <Element 'i' at 0x111c5c3b8>]
If you want to find one element inside of another, use a
/. For example,
this finds italics tags inside of a paragraph:
[<Element 'i' at 0x111c5c3b8>]
Finally, we can specific attributes using square brackets:
[<Element 'i' at 0x111c5c3b8>]
These will go a long way towards letting us parse information in the Wikipedia XML output.
Let's try to apply what we have now seen to some actual data from Wikipedia. Load
import wiki assert wiki.__version__ >= 3
data = wiki.get_wiki_json("Paris") html = data['text']['*'] html[:1000]
'<div class="mw-parser-output"><div role="note" class="hatnote navigation-not-searchable">This article is about the capital of France. For other uses, see <a href="/wiki/Paris_(disambiguation)" class="mw-disambig" title="Paris (disambiguation)">Paris (disambiguation)</a>.</div>\n<p class="mw-empty-elt">\n\n\n\n</p>\n<div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Commune and department in Île-de-France, France</div>\n<table class="infobox geography vcard" style="width:22em;width:23em"><tbody><tr><th colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:1.25em; white-space:nowrap"><span class="fn org">Paris</span></th></tr><tr><td colspan="2" style="text-align:center;background-color:#cddeff; font-weight:bold;">\n<span class="category"><a href="/wiki/Communes_of_France" title="Communes of France">Commune</a> and <a href="/wiki/Departments_of_France" title="Departments of France">department</a></span></td></tr><tr class="mergedtopro'
Now, create a
xml.etree.ElementTree.Element object named
tree from the
tree = ET.fromstring(html)
Using a for loop, create a list named
p with one element for each paragraph in
containing all of the text in the paragraph.
p =  for child in tree: if child.tag == 'p': p.append("".join(child.itertext()))
p should contain just four new lines. Check to make sure
p matches the first real paragraph on the Wikipedia page.
"Paris (French pronunciation:\xa0\u200b[paʁi]\xa0(\xa0listen)) is the capital and most populous city of France, with an area of 105 square kilometres (41 square miles) and a population of 2,206,488. With 200,000 inhabitants in 1328, Paris, then already the capital of France, was the most populous city of Europe. By comparison, London in 1300 had 80,000 inhabitants. Since the 17th century, Paris has been one of Europe's major centres of finance, commerce, fashion, science, music, and painting. The Paris Region had a GDP of €681 billion (US$850 billion) in 2016, accounting for 31 per cent of the GDP of France. In 2013–2014, the Paris Region had the third-highest GDP in the world and the largest regional GDP in the EU. According to the Economist Intelligence Unit Worldwide Cost of Living Survey in 2018, Paris was the second-most expensive city in the world, behind Singapore and ahead of Zurich, Hong Kong, Oslo and Geneva.\n"
Using an XPath expression, find all of the 'h2' elements (you do not need to save them). These correspond to the section headings in the article.
[<Element 'h2' at 0x112738638>, <Element 'h2' at 0x11275a818>, <Element 'h2' at 0x11275ec78>, <Element 'h2' at 0x1127764a8>, <Element 'h2' at 0x1127a2408>, <Element 'h2' at 0x1127bfd68>, <Element 'h2' at 0x1127dfe58>, <Element 'h2' at 0x11281c4a8>, <Element 'h2' at 0x11283e188>, <Element 'h2' at 0x112852ae8>, <Element 'h2' at 0x1104827c8>, <Element 'h2' at 0x1104964f8>, <Element 'h2' at 0x1104a2b38>, <Element 'h2' at 0x10fb39318>, <Element 'h2' at 0x10fb3f188>, <Element 'h2' at 0x10fb47e58>, <Element 'h2' at 0x10fb70818>, <Element 'h2' at 0x10fb763b8>, <Element 'h2' at 0x112387e08>, <Element 'h2' at 0x112390598>]
Now, there is a 'span' element inside of the headers of class "mw-headline"
that contains the actual text of the section. Write an XPath expression that
grabs these elements and store them as a variable named
headings = tree.findall('.//h2/span[@class="mw-headline"]')
Now, cycle through the headings, extract the
text element and append these
two a list named
headings_text =  for x in headings: headings_text.append(x.text)
Print out the object
['Etymology', 'History', 'Geography', 'Administration', 'Cityscape', 'Demographics', 'Economy', 'Tourism', 'Culture', 'Education', 'Sports', 'Infrastructure', 'Healthcare', 'Media', 'International relations', 'See also', 'References', 'Further reading', 'External links']
Verify that these links match those on the page.
Finally, there is a special Wikipedia XML span element of class 'geo'.
The page may contain many of these, but we only need to the first so
tree.find in place of
tree.findall. In the code below, find this
first element and extract the text:
You should see the string '48.8567; 2.3508'. This is the latitude and longitude of Paris. We would be able to automate detection of this information to add context to any pages with an associate latitude and longitude.