## Tutorial 29: Wikipedia image data¶

This tutorial introduces the wikiimage.py module, which we can use to grab and process image data from Wikipedia pages. Start by reading in the module, as well as numpy and pylab (for plotting the images).

In [1]:
%pylab inline

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import wiki
import wikiimage
import wikitext

Populating the interactive namespace from numpy and matplotlib

In [2]:
plt.rcParams["figure.figsize"] = (12, 16)


### Reading image data from Wikipedia¶

The image_data_frame takes a list of Wikipedia pages and returns a data frame object showing all of the images from the page. You can also supply the minimum and maximum allowed sizes of images. By default the function will download a local version of any images you do not yet have locally.

In [3]:
df = wikiimage.image_data_frame(['Paris', 'London'], min_size=300)
df

Pulling image from MediaWiki: 'parisiicoins.jpg'
Pulling image from MediaWiki: 'bibliothcaquenationaledefrancecparissiterichelieu-salleovale.jpg'
Pulling image from MediaWiki: 'placedelarcapubliquechcunefoulesilencieuse.jpg'
Pulling image from MediaWiki: 'conseildetatpariswa.jpg'
Pulling image from MediaWiki: 'mignard-autoportrait.jpg'
Pulling image from MediaWiki: 'parisjuly-a.jpg'
Pulling image from MediaWiki: 'themuscaedorsayatsunsetcparisjuly.jpg'
Pulling image from MediaWiki: 'londonmontagel.jpg'
Pulling image from MediaWiki: 'englandsoutheastlocationmap.svg.png'
Pulling image from MediaWiki: 'londonthamessunsetpanorama-feb.jpg'
Pulling image from MediaWiki: 'neasdentemple-shreeswaminarayanhindumandir-gate.jpg'

Out[3]:

Note that the returned results include the page name, the path of the image, as well as a column called "max_size". The latter column gives the size of the largest dimension of the image (either the height or width).

### Displaying the images in Python¶

The load_image function takes the name of an image and returns a PIL object, a special image type that can be plotted in Python.

In [4]:
img = wikiimage.load_image(df.img.values[4])
type(img)

Using TensorFlow backend.

Out[4]:
PIL.JpegImagePlugin.JpegImageFile
In [5]:
plt.imshow(img)

Out[5]:
<matplotlib.image.AxesImage at 0xb3aefef98>

Here is some Python code that prints all of the image in the data frame. Note that you may need to modify the line plt.subplot(4, 3, ind + 1) if you change the data. The 4 gives the number of columns in the plot and the 3 gives the number of rows. If you have more than 12 images, only the first 12 will be shown. You can also adjust the plt.rcParams["figure.figsize"] = (12, 16) above to change the overall size of the print out (I find that I need to adjust this depending on my screen and the images in question).

In [6]:
for ind, idx in enumerate(range(df.shape[0])):
try: