Tutorial 16: Parsing XML

Extensible Markup Language, more commonly know as XML, is a standard format for structuring documents and information. One particular extension is XHTML, a standard used to describe the content of webpages.

We have already worked a bit with parsing the (X)HTML code returned from the MediaWiki API using regular expressions. Regular expressions are a great way to start, but for more extensive use a proper library that fully parses XML offers much more control and avoid common pitfalls.

In these notes we see how to use the xml module to parse the text returned from Wikipedia.

Creating an ElementTree object

It will be easier to understand how to parse XML code using a smaller example than we would get from Wikipedia. Here is a very simple snippet of code that contains a title and two paragraphs.

In [1]:
html = """<div>
<h1 class='page'><i>A title in italics</i></h1>
<p>Here is one paragraph of text with something in a <b>bold</b> font.</p>
<p> Another paragraph! In this case I have <a href="https://github.com">a link</a> that <i id='my'>you</i> click on.</p>
</div>
"""

print(html)

We start by reading in the submodule xml.etree.ElementTree. By convention, we'll save it as ET.

In [2]:
import xml.etree.ElementTree as ET

Next, use the fromstring function to take the string and convert it into an ElementTree object.

In [3]:
tree = ET.fromstring(html)
type(tree)

The object has three element corresponding to the three top-level elements in the XML. Elements are accessed the same way they would be in a list: with square brackets and an index. The first element is our h1 element:

In [4]:
print(tree[0])
<Element 'h1' at 0x111c48b88>

The second elements two are paragraph tags:

In [5]:
print(tree[1])
print(tree[2])
<Element 'p' at 0x111c5c278>
<Element 'p' at 0x111c5c318>

Also like a list, we can cycle through the elements with a for loop:

In [6]:
for child in tree:
    print(child)
<Element 'h1' at 0x111c48b88>
<Element 'p' at 0x111c5c278>
<Element 'p' at 0x111c5c318>

Finally, we can also manually convert the tree to a list:

In [7]:
list(tree)
Out[7]:
[<Element 'h1' at 0x111c48b88>,
 <Element 'p' at 0x111c5c278>,
 <Element 'p' at 0x111c5c318>]

Typically there is not much reason to manually convert an ElementTree into a list in your final code, but it can be very useful when testing and debugging.

Working with XML Elements

Let's take the first element of tree, the title of our document.

In [8]:
child = tree[0]

There are several useful properties given to us by the element. The tag properties of the element is a string giving the type of element.

In [9]:
child.tag
Out[9]:
'h1'

The attrib property is a dictionary that yields the properties (if there are any) of the XML tag. Looking at the 'h1' element in the example, we see that there is an attribute named 'class' that's equal to 'page'.

In [10]:
child.attrib
Out[10]:
{'class': 'page'}

Finally, the property text contains the actual text inside of the element.

In [11]:
child.text

You should notice that there is no text in the tag? What's going on here?! If you look at the XML input, there is an 'i' tag inside of the 'h1' tag and all of the text is inside of this tag. We can see all of the elements inside of 'h1', as above, using the list function:

In [12]:
list(child)
Out[12]:
[<Element 'i' at 0x111c5c1d8>]

This child of the child has a tag equal to 'i' (its an italic symbol in HTML):

In [13]:
child[0].tag
Out[13]:
'i'

But no attributes:

In [14]:
child[0].attrib
Out[14]:
{}

However, it does have a text property containg the actual text:

In [15]:
child[0].text
Out[15]:
'A title in italics'

Let's now work with the first paragraph element:

In [16]:
child = tree[1]
child.tag
Out[16]:
'p'

As you should expect, it has a 'b' (bold) element inside of it:

In [17]:
list(child)
Out[17]:
[<Element 'b' at 0x111c5c2c8>]

What happens if we try to grab the text?

In [18]:
child.text
Out[18]:
'Here is one paragraph of text with something in a '

It only contains the text up to the 'b' tag, similar to what happened with the title element... This could get be very difficult to work with if we wanted all of the information inside of a paragraph or other element. The solution is to use the method itertext; it (when converted into a list) returns all of the text inside of an element.

In [19]:
list(child.itertext())
Out[19]:
['Here is one paragraph of text with something in a ', 'bold', ' font.']

The individual elements can be combined by using the function join:

In [20]:
"".join(child.itertext())
Out[20]:
'Here is one paragraph of text with something in a bold font.'

Loops and XPath Expression

We now have the basic elements for working with an XML document. If we wanted, for example, to get a list with one element for each paragraph we could use a for loop and if statement:

In [21]:
p = []
for child in tree:
    if child.tag == "p":
        text = "".join(child.itertext())
        p.append(text)
        
p
Out[21]:
['Here is one paragraph of text with something in a bold font.',
 ' Another paragraph! In this case I have a link that you click on.']

For some applications, this approach (cycling through children) is ideal. One drawback, however, is that it becomes difficult to find elements that might be buried deeper in the XML tree. For example, if we wanted all links in the document.

A way to address this is to use a notation called an XPath Expression that describes a element in an XML document. We won't go into the full spec for XPath expression, but will show a few examples that will be most useful.

To use an XPath expression to find nodes in an ElementTree, we use the findall method. A simply query simply just starts with './/' (this means that the tag can start anywhere) and includes the name of the tags that you want to find:

In [22]:
list(tree.findall(".//i"))
Out[22]:
[<Element 'i' at 0x111c5c1d8>, <Element 'i' at 0x111c5c3b8>]

If you want to find one element inside of another, use a /. For example, this finds italics tags inside of a paragraph:

In [23]:
list(tree.findall(".//p/i"))
Out[23]:
[<Element 'i' at 0x111c5c3b8>]

Finally, we can specific attributes using square brackets:

In [24]:
list(tree.findall(".//i[@id='my']"))
Out[24]:
[<Element 'i' at 0x111c5c3b8>]

These will go a long way towards letting us parse information in the Wikipedia XML output.

Wikipedia Application

Let's try to apply what we have now seen to some actual data from Wikipedia. Load the wiki module:

In [25]:
import wiki

assert wiki.__version__ >= 3

And pull up the page on Plato (it will be useful to also open the Paris page itself.)

In [26]:
data = wiki.get_wiki_json("Paris")
html = data['text']['*']
html[:1000]
Out[26]:
'<div class="mw-parser-output"><div role="note" class="hatnote navigation-not-searchable">This article is about the capital of France. For other uses, see <a href="/wiki/Paris_(disambiguation)" class="mw-disambig" title="Paris (disambiguation)">Paris (disambiguation)</a>.</div>\n<p class="mw-empty-elt">\n\n\n\n</p>\n<div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Commune and department in Île-de-France, France</div>\n<table class="infobox geography vcard" style="width:22em;width:23em"><tbody><tr><th colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:1.25em; white-space:nowrap"><span class="fn org">Paris</span></th></tr><tr><td colspan="2" style="text-align:center;background-color:#cddeff; font-weight:bold;">\n<span class="category"><a href="/wiki/Communes_of_France" title="Communes of France">Commune</a> and <a href="/wiki/Departments_of_France" title="Departments of France">department</a></span></td></tr><tr class="mergedtopro'

Now, create a xml.etree.ElementTree.Element object named tree from the html data.

In [27]:
tree = ET.fromstring(html)

Using a for loop, create a list named p with one element for each paragraph in tree containing all of the text in the paragraph.

In [28]:
p = []
for child in tree:
    if child.tag == 'p':
        p.append("".join(child.itertext()))

The element p[0] should contain just four new lines. Check to make sure that p[1] matches the first real paragraph on the Wikipedia page.

In [29]:
p[1]
Out[29]:
"Paris (French pronunciation:\xa0\u200b[paʁi]\xa0(\xa0listen)) is the capital and most populous city of France, with an area of 105 square kilometres (41 square miles) and a population of 2,206,488.[5][6] With 200,000 inhabitants in 1328, Paris, then already the capital of France, was the most populous city of Europe. By comparison, London in 1300 had 80,000 inhabitants.[7] Since the 17th century, Paris has been one of Europe's major centres of finance, commerce, fashion, science, music, and painting. The Paris Region had a GDP of €681 billion (US$850 billion) in 2016, accounting for 31 per cent of the GDP of France.[8] In 2013–2014, the Paris Region had the third-highest GDP in the world and the largest regional GDP in the EU. According to the Economist Intelligence Unit Worldwide Cost of Living Survey in 2018, Paris was the second-most expensive city in the world, behind Singapore and ahead of Zurich, Hong Kong, Oslo and Geneva.[9]\n"

Using an XPath expression, find all of the 'h2' elements (you do not need to save them). These correspond to the section headings in the article.

In [30]:
tree.findall('.//h2')
Out[30]:
[<Element 'h2' at 0x112738638>,
 <Element 'h2' at 0x11275a818>,
 <Element 'h2' at 0x11275ec78>,
 <Element 'h2' at 0x1127764a8>,
 <Element 'h2' at 0x1127a2408>,
 <Element 'h2' at 0x1127bfd68>,
 <Element 'h2' at 0x1127dfe58>,
 <Element 'h2' at 0x11281c4a8>,
 <Element 'h2' at 0x11283e188>,
 <Element 'h2' at 0x112852ae8>,
 <Element 'h2' at 0x1104827c8>,
 <Element 'h2' at 0x1104964f8>,
 <Element 'h2' at 0x1104a2b38>,
 <Element 'h2' at 0x10fb39318>,
 <Element 'h2' at 0x10fb3f188>,
 <Element 'h2' at 0x10fb47e58>,
 <Element 'h2' at 0x10fb70818>,
 <Element 'h2' at 0x10fb763b8>,
 <Element 'h2' at 0x112387e08>,
 <Element 'h2' at 0x112390598>]

Now, there is a 'span' element inside of the headers of class "mw-headline" that contains the actual text of the section. Write an XPath expression that grabs these elements and store them as a variable named headings:

In [31]:
headings = tree.findall('.//h2/span[@class="mw-headline"]')

Now, cycle through the headings, extract the text element and append these two a list named headings_text:

In [32]:
headings_text = []
for x in headings:
    headings_text.append(x.text)

Print out the object headings_text:

In [33]:
headings_text
Out[33]:
['Etymology',
 'History',
 'Geography',
 'Administration',
 'Cityscape',
 'Demographics',
 'Economy',
 'Tourism',
 'Culture',
 'Education',
 'Sports',
 'Infrastructure',
 'Healthcare',
 'Media',
 'International relations',
 'See also',
 'References',
 'Further reading',
 'External links']

Verify that these links match those on the page.

Finally, there is a special Wikipedia XML span element of class 'geo'. The page may contain many of these, but we only need to the first so use tree.find in place of tree.findall. In the code below, find this first element and extract the text:

In [34]:
tree.find(".//span[@class='geo']").text
Out[34]:
'48.8567; 2.3508'

You should see the string '48.8567; 2.3508'. This is the latitude and longitude of Paris. We would be able to automate detection of this information to add context to any pages with an associate latitude and longitude.