Tutorial 29: Wikipedia image data

This tutorial introduces the wikiimage.py module, which we can use to grab and process image data from Wikipedia pages. Start by reading in the module, as well as numpy and pylab (for plotting the images).

In [1]:
%pylab inline

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import wiki
import wikiimage
import wikitext
Populating the interactive namespace from numpy and matplotlib
In [2]:
plt.rcParams["figure.figsize"] = (12, 16)

Reading image data from Wikipedia

The image_data_frame takes a list of Wikipedia pages and returns a data frame object showing all of the images from the page. You can also supply the minimum and maximum allowed sizes of images. By default the function will download a local version of any images you do not yet have locally.

In [3]:
df = wikiimage.image_data_frame(['Paris', 'London'], min_size=300)
Pulling image from MediaWiki: 'parisiicoins.jpg'
Pulling image from MediaWiki: 'bibliothcaquenationaledefrancecparissiterichelieu-salleovale.jpg'
Pulling image from MediaWiki: 'placedelarcapubliquechcunefoulesilencieuse.jpg'
Pulling image from MediaWiki: 'conseildetatpariswa.jpg'
Pulling image from MediaWiki: 'mignard-autoportrait.jpg'
Pulling image from MediaWiki: 'parisjuly-a.jpg'
Pulling image from MediaWiki: 'themuscaedorsayatsunsetcparisjuly.jpg'
Pulling image from MediaWiki: 'londonmontagel.jpg'
Pulling image from MediaWiki: 'englandsoutheastlocationmap.svg.png'
Pulling image from MediaWiki: 'unitedkingdomadmlocationmap.svg.png'
Pulling image from MediaWiki: 'londonthamessunsetpanorama-feb.jpg'
Pulling image from MediaWiki: 'neasdentemple-shreeswaminarayanhindumandir-gate.jpg'
page img max_size img_links
0 Paris parisiicoins.jpg 330 https://upload.wikimedia.org/wikipedia/commons...
1 Paris bibliothcaquenationaledefrancecparissiterichel... 300 https://upload.wikimedia.org/wikipedia/commons...
2 Paris placedelarcapubliquechcunefoulesilencieuse.jpg 350 https://upload.wikimedia.org/wikipedia/commons...
3 Paris conseildetatpariswa.jpg 330 https://upload.wikimedia.org/wikipedia/commons...
4 Paris mignard-autoportrait.jpg 304 https://upload.wikimedia.org/wikipedia/commons...
5 Paris parisjuly-a.jpg 330 https://upload.wikimedia.org/wikipedia/commons...
6 Paris themuscaedorsayatsunsetcparisjuly.jpg 330 https://upload.wikimedia.org/wikipedia/commons...
7 London londonmontagel.jpg 415 https://upload.wikimedia.org/wikipedia/commons...
8 London englandsoutheastlocationmap.svg.png 371 https://upload.wikimedia.org/wikipedia/commons...
9 London unitedkingdomadmlocationmap.svg.png 386 https://upload.wikimedia.org/wikipedia/commons...
10 London londonthamessunsetpanorama-feb.jpg 300 https://upload.wikimedia.org/wikipedia/commons...
11 London neasdentemple-shreeswaminarayanhindumandir-gat... 340 https://upload.wikimedia.org/wikipedia/commons...

Note that the returned results include the page name, the path of the image, as well as a column called "max_size". The latter column gives the size of the largest dimension of the image (either the height or width).

Displaying the images in Python

The load_image function takes the name of an image and returns a PIL object, a special image type that can be plotted in Python.

In [4]:
img = wikiimage.load_image(df.img.values[4])
Using TensorFlow backend.
In [5]:
<matplotlib.image.AxesImage at 0xb3aefef98>

Here is some Python code that prints all of the image in the data frame. Note that you may need to modify the line plt.subplot(4, 3, ind + 1) if you change the data. The 4 gives the number of columns in the plot and the 3 gives the number of rows. If you have more than 12 images, only the first 12 will be shown. You can also adjust the plt.rcParams["figure.figsize"] = (12, 16) above to change the overall size of the print out (I find that I need to adjust this depending on my screen and the images in question).

In [6]:
for ind, idx in enumerate(range(df.shape[0])):
        plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
        plt.subplot(4, 3, ind + 1)

        img = wikiimage.load_image(df.iloc[idx]['img'])