Tutorial 11: Example Analysis of American Authors

In this tutorial, we combine all of our skills from the semester so far (as well as some new libraries) to do an actual data analysis from Wikipedia.

Question of interest

In this tutorial we will focus on collecting Wikipedia data from a list of American novelists. Our initial question of interest will be figuring out the relationship between several metrics within the pages that all, in their own way, indicate how prominently a given author is represented. We will see how to store this data as CSV file as well as how to produce interactive graphics to explore the dataset.

wiki.py

Last time we found a bug in my wiki.py code. Please re-download and replace the script with the original. The following loads the module and checks that you have version 2 or higher (if there is no error, it works!):

In [1]:
import wiki

assert wiki.__version__ >= 2

Getting the data

There is a list on Wikipedia of pages for American novelists. Please start by looking at the page in your browser:

Notice that many of the links on this page are to specific novels by an author. There are also links at the top and bottom of the page before the actual lists starts. We don't want these in our analysis! In order to just grab links to actual authors, we need to use regular expressions and parse data directly from the page.

Start by loading the re module

In [2]:
import re

Next, grab the page for the list of American novelists. We'll need the actual text of the page, which I print out the first 1000 characters of here for reference:

In [3]:
data = wiki.get_wiki_json("List_of_American_novelists")
data_html = data['text']['*']
print(data_html[:1000])
Pulling data from MediaWiki API: 'List_of_American_novelists'
<div class="mw-parser-output"><div role="note" class="hatnote navigation-not-searchable">Main category: <a href="/wiki/Category:American_novelists" title="Category:American novelists">American novelists</a></div>
<p>This is a <b>list of <a href="/wiki/Novelist" title="Novelist">novelists</a> from the <a href="/wiki/United_States" title="United States">United States</a></b>, listed with titles of a major work for each.
</p><p>This is not intended to be a list of every American (born U.S. citizen, naturalized citizen, or long-time resident alien) who has published a <a href="/wiki/Novel" title="Novel">novel</a>. (For the purposes of this article, <i>novel</i> is defined as an extended work of <a href="/wiki/Fiction" title="Fiction">fiction</a>. This definition is loosely interpreted to include novellas, novelettes, and books of interconnected short stories.) <a href="/wiki/Novelist" title="Novelist">Novelists</a> on this list have achieved a notability that exceeds merely having been pub

In order to just get authors, a trick we can use on this page is to only find links that come after the HTML tag <li> (a list item). This will avoid most of links we don't want, but will accidentally grab a few at the bottom of the page. We deal with those in a moment. Here is the regular expression that grabs the pages of interest.

In [4]:
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
authors[:10]
Out[4]:
['Patricia_Aakhus',
 'Atia_Abawi',
 'Edward_Abbey',
 'Lynn_Abbey',
 'Belle_Kendrick_Abbott',
 'Eleanor_Hallowell_Abbott',
 'Hailey_Abbott',
 'Megan_Abbott',
 'Shana_Ab%C3%A9',
 'Louise_Abeita']

The list of authors includes over 1600 pages.

In [5]:
len(authors)
Out[5]:
1612

Notice that the list includes a few links at the bottom that we do not actually want in our data.

In [6]:
authors[-40:]
Out[6]:
['Anzia_Yezierska',
 'Rafael_Yglesias',
 'Mako_Yoshikawa',
 'Al_Young',
 'Stark_Young',
 'Michele_Young-Stone',
 'Rafi_Zabor',
 'Roger_Zelazny',
 'Paul_Zindel',
 'Nell_Zink',
 'Leane_Zugsmith',
 'American_literature',
 'Colonial_American_literature',
 'Southern_literature',
 'African_American_literature',
 'Jewish_American_literature',
 'LGBT_literature',
 'Lists_of_writers',
 'List_of_short_story_authors',
 'List_of_novelists_by_nationality',
 'List_of_women_writers',
 'List_of_African-American_writers',
 'List_of_Asian-American_writers',
 'List_of_Jewish_American_authors',
 'List_of_writers_from_peoples_indigenous_to_the_Americas',
 'Pulitzer_Prize_for_the_Novel',
 'Pulitzer_Prize_for_Fiction',
 'National_Book_Award',
 'Bestseller',
 'Publishers_Weekly_lists_of_bestselling_novels_in_the_United_States',
 'List_of_Australian_novelists',
 'List_of_English_novelists',
 'List_of_French_novelists',
 'List_of_Ghanaian_novelists',
 'List_of_Irish_novelists',
 'List_of_Korean_novelists',
 'List_of_Nigerian_novelists',
 'List_of_Polish_novelists',
 'List_of_Portuguese_novelists',
 'List_of_Scottish_novelists']

We will 'cut-off' the list of authors manually with the follow code (similar to how I cut out the header and footer of the raw HTML code in Tutorial 6).

In [7]:
authors = authors[:(authors.index('Leane_Zugsmith') + 1)]
authors[-10:]
Out[7]:
['Rafael_Yglesias',
 'Mako_Yoshikawa',
 'Al_Young',
 'Stark_Young',
 'Michele_Young-Stone',
 'Rafi_Zabor',
 'Roger_Zelazny',
 'Paul_Zindel',
 'Nell_Zink',
 'Leane_Zugsmith']

Now that we have our list of authors, let's grab them all (or verify that we have all of the links already).

In [8]:
for link in authors:
    wiki.get_wiki_json(link)

Page metrics

Now that we have the pages for each of the authors, we want to gather a number of metrics about each page. To write code that does this, typically I start by playing around with a single page and then wrap it all up in a for loop.

For example, I wrote and tested the following code to figure out several metrics of interest:

In [9]:
# load a single page of data
data = wiki.get_wiki_json("Mark_Twain")
In [10]:
# (1) get the title of the page
data['title']
Out[10]:
'Mark Twain'
In [11]:
# (2) determine the number of links to other languages
len(data['langlinks'])
Out[11]:
140
In [12]:
# (3) determine number of internal links
len(data['links'])
Out[12]:
733
In [13]:
# (4) determine number of characters in the text of the page
len(data['text']['*'])
Out[13]:
345621
In [14]:
# (5) determine number of external links
len(data['externallinks'])
Out[14]:
144

You can use similar code to compute other metrics, such as the number of images used on the page.

Aggregating metrics

Now, we will use a for loop to collect the metadata and metrics described in the prior section for each Wikipedia page in our corpus. We cycle through each author, appending the new metric values to the lists at the top of the code block. Fill in the information inside of the for loop to append the metrics to each page.

In [15]:
author_name = []
num_langs = []
num_links = []
num_chars = []
num_elinks = []

for link in authors:
    data = wiki.get_wiki_json(link)
    
    author_name.append(data['title'])
    num_langs.append(len(data['langlinks']))
    num_links.append(len(data['links']))
    num_chars.append(len(data['text']['*']))
    num_elinks.append(len(data['externallinks']))

Now, it will be useful to put all of this data together in a single table. The standard library for working with tabular data in Python is called pandas, which we import here:

In [16]:
import pandas as pd

The object that stores tabular data in pandas is called a DataFrame (yes, it's based on the data frame object native to R). There are many ways to build a data frame object from a collection of lists, but this block below illustrates my favorite method using an OrderedDict. Below, I'll print out a copy of the table (notice that it prints nicely in a the Jupyter notebook).

In [17]:
import collections

df = collections.OrderedDict()
df['author_name'] = author_name
df['url'] = authors
df['num_langs'] = num_langs
df['num_links'] = num_links
df['num_chars'] = num_chars
df['num_elinks'] = num_elinks

df = pd.DataFrame(df)
df
Out[17]:
author_name url num_langs num_links num_chars num_elinks
0 Patricia Aakhus Patricia_Aakhus 4 18 16048 11
1 Atia Abawi Atia_Abawi 1 49 29361 25
2 Edward Abbey Edward_Abbey 14 179 129965 71
3 Lynn Abbey Lynn_Abbey 9 339 72752 22
4 Belle Kendrick Abbott Belle_Kendrick_Abbott 0 22 10906 8
5 Eleanor Hallowell Abbott Eleanor_Hallowell_Abbott 10 133 41495 15
6 Hailey Abbott Hailey_Abbott 0 19 9818 10
7 Megan Abbott Megan_Abbott 10 47 31586 25
8 Shana Abé Shana_Ab%C3%A9 1 17 15169 11
9 Louise Abeita Louise_Abeita 0 27 17743 7
10 Robert H. Abel Robert_H._Abel 1 26 23803 17
11 Aberjhani Aberjhani 1 106 56226 33
12 Walter Abish Walter_Abish 9 63 22916 17
13 Abiola Abrams Abiola_Abrams 1 78 39768 33
14 Diana Abu-Jaber Diana_Abu-Jaber 1 50 18957 17
15 Susan Abulhawa Susan_Abulhawa 10 37 35559 34
16 Kathy Acker Kathy_Acker 18 173 95423 72
17 Cherry Adair Cherry_Adair 1 26 31689 21
18 Alice Adams (writer) Alice_Adams_(writer) 1 35 25544 16
19 Henry Brooks Adams Henry_Brooks_Adams 0 1 958 0
20 Yda Addis Yda_Addis 0 1 972 0
21 Kim Addonizio Kim_Addonizio 3 34 34623 36
22 George Ade George_Ade 8 64 40080 23
23 Renata Adler Renata_Adler 4 86 45092 27
24 Warren Adler Warren_Adler 6 58 38336 25
25 James Agee James_Agee 26 310 75178 42
26 Charlotte Agell Charlotte_Agell 2 39 87868 17
27 Kelli Russell Agodon Kelli_Russell_Agodon 0 15 24826 27
28 Conrad Aiken Conrad_Aiken 26 166 60924 49
29 Hiag Akmakjian Hiag_Akmakjian 2 30 30156 19
... ... ... ... ... ... ...
1553 Samuel Woodworth Samuel_Woodworth 0 39 24215 19
1554 Cornell Woolrich Cornell_Woolrich 18 157 47176 25
1555 Constance Fenimore Woolson Constance_Fenimore_Woolson 7 69 74045 24
1556 Herman Wouk Herman_Wouk 28 343 93201 48
1557 Austin Tappan Wright Austin_Tappan_Wright 1 58 25321 12
1558 Ernest Vincent Wright Ernest_Vincent_Wright 7 16 17895 12
1559 Harold Bell Wright Harold_Bell_Wright 1 84 45844 25
1560 Kirby Wright Kirby_Wright 0 61 23988 10
1561 Mary Tappan Wright Mary_Tappan_Wright 1 49 55425 65
1562 Richard Wright (author) Richard_Wright_(author) 27 195 117945 63
1563 Stephen Wright (writer) Stephen_Wright_(writer) 0 30 14062 10
1564 Xu Xi (writer) Xu_Xi_(writer) 1 38 27696 23
1565 Irvin D. Yalom Irvin_D._Yalom 24 258 62451 31
1566 Lois-Ann Yamanaka Lois-Ann_Yamanaka 0 54 26022 17
1567 Karen Tei Yamashita Karen_Tei_Yamashita 1 40 24342 23
1568 Chelsea Quinn Yarbro Chelsea_Quinn_Yarbro 3 133 39965 28
1569 Steve Yarbrough (writer) Steve_Yarbrough_(writer) 0 32 12918 13
1570 Richard Yates (novelist) Richard_Yates_(novelist) 15 132 49604 33
1571 Frank Yerby Frank_Yerby 3 49 33063 23
1572 Anzia Yezierska Anzia_Yezierska 5 68 43655 39
1573 Rafael Yglesias Rafael_Yglesias 2 66 21878 17
1574 Mako Yoshikawa Mako_Yoshikawa 1 22 14999 14
1575 Al Young Al_Young 0 71 30751 32
1576 Stark Young Stark_Young 0 41 17200 13
1577 Michele Young-Stone Michele_Young-Stone 0 19 14284 12
1578 Rafi Zabor Rafi_Zabor 1 34 18731 17
1579 Roger Zelazny Roger_Zelazny 35 177 74495 32
1580 Paul Zindel Paul_Zindel 5 190 51223 22
1581 Nell Zink Nell_Zink 2 52 43856 30
1582 Leane Zugsmith Leane_Zugsmith 0 33 19704 13

1583 rows × 6 columns

Pandas has a convenient method for storing a table of data as a CSV (comma seperated values) file. Running the code below will save the table as the file "american_authors.csv"; it can be read into programs such as Excel, Googe Sheets, and other programming languages.

In [18]:
df.to_csv("american_authors.csv", index=False)

If you open the file browser, you'll see the CSV file show up in your 'tutorials' directory. You can similarly read a csv file back in Python using the pd.read_csv function.

In [19]:
new_df = pd.read_csv("american_authors.csv")
new_df
Out[19]:
author_name url num_langs num_links num_chars num_elinks
0 Patricia Aakhus Patricia_Aakhus 4 18 16048 11
1 Atia Abawi Atia_Abawi 1 49 29361 25
2 Edward Abbey Edward_Abbey 14 179 129965 71
3 Lynn Abbey Lynn_Abbey 9 339 72752 22
4 Belle Kendrick Abbott Belle_Kendrick_Abbott 0 22 10906 8
5 Eleanor Hallowell Abbott Eleanor_Hallowell_Abbott 10 133 41495 15
6 Hailey Abbott Hailey_Abbott 0 19 9818 10
7 Megan Abbott Megan_Abbott 10 47 31586 25
8 Shana Abé Shana_Ab%C3%A9 1 17 15169 11
9 Louise Abeita Louise_Abeita 0 27 17743 7
10 Robert H. Abel Robert_H._Abel 1 26 23803 17
11 Aberjhani Aberjhani 1 106 56226 33
12 Walter Abish Walter_Abish 9 63 22916 17
13 Abiola Abrams Abiola_Abrams 1 78 39768 33
14 Diana Abu-Jaber Diana_Abu-Jaber 1 50 18957 17
15 Susan Abulhawa Susan_Abulhawa 10 37 35559 34
16 Kathy Acker Kathy_Acker 18 173 95423 72
17 Cherry Adair Cherry_Adair 1 26 31689 21
18 Alice Adams (writer) Alice_Adams_(writer) 1 35 25544 16
19 Henry Brooks Adams Henry_Brooks_Adams 0 1 958 0
20 Yda Addis Yda_Addis 0 1 972 0
21 Kim Addonizio Kim_Addonizio 3 34 34623 36
22 George Ade George_Ade 8 64 40080 23
23 Renata Adler Renata_Adler 4 86 45092 27
24 Warren Adler Warren_Adler 6 58 38336 25
25 James Agee James_Agee 26 310 75178 42
26 Charlotte Agell Charlotte_Agell 2 39 87868 17
27 Kelli Russell Agodon Kelli_Russell_Agodon 0 15 24826 27
28 Conrad Aiken Conrad_Aiken 26 166 60924 49
29 Hiag Akmakjian Hiag_Akmakjian 2 30 30156 19
... ... ... ... ... ... ...
1553 Samuel Woodworth Samuel_Woodworth 0 39 24215 19
1554 Cornell Woolrich Cornell_Woolrich 18 157 47176 25
1555 Constance Fenimore Woolson Constance_Fenimore_Woolson 7 69 74045 24
1556 Herman Wouk Herman_Wouk 28 343 93201 48
1557 Austin Tappan Wright Austin_Tappan_Wright 1 58 25321 12
1558 Ernest Vincent Wright Ernest_Vincent_Wright 7 16 17895 12
1559 Harold Bell Wright Harold_Bell_Wright 1 84 45844 25
1560 Kirby Wright Kirby_Wright 0 61 23988 10
1561 Mary Tappan Wright Mary_Tappan_Wright 1 49 55425 65
1562 Richard Wright (author) Richard_Wright_(author) 27 195 117945 63
1563 Stephen Wright (writer) Stephen_Wright_(writer) 0 30 14062 10
1564 Xu Xi (writer) Xu_Xi_(writer) 1 38 27696 23
1565 Irvin D. Yalom Irvin_D._Yalom 24 258 62451 31
1566 Lois-Ann Yamanaka Lois-Ann_Yamanaka 0 54 26022 17
1567 Karen Tei Yamashita Karen_Tei_Yamashita 1 40 24342 23
1568 Chelsea Quinn Yarbro Chelsea_Quinn_Yarbro 3 133 39965 28
1569 Steve Yarbrough (writer) Steve_Yarbrough_(writer) 0 32 12918 13
1570 Richard Yates (novelist) Richard_Yates_(novelist) 15 132 49604 33
1571 Frank Yerby Frank_Yerby 3 49 33063 23
1572 Anzia Yezierska Anzia_Yezierska 5 68 43655 39
1573 Rafael Yglesias Rafael_Yglesias 2 66 21878 17
1574 Mako Yoshikawa Mako_Yoshikawa 1 22 14999 14
1575 Al Young Al_Young 0 71 30751 32
1576 Stark Young Stark_Young 0 41 17200 13
1577 Michele Young-Stone Michele_Young-Stone 0 19 14284 12
1578 Rafi Zabor Rafi_Zabor 1 34 18731 17
1579 Roger Zelazny Roger_Zelazny 35 177 74495 32
1580 Paul Zindel Paul_Zindel 5 190 51223 22
1581 Nell Zink Nell_Zink 2 52 43856 30
1582 Leane Zugsmith Leane_Zugsmith 0 33 19704 13

1583 rows × 6 columns

Plotting data

Another useful feature of the Pandas library is that it makes it easy to produce plots of the data stored within a table. Here is some example code for producing a scatter plot from our Pandas dataset

In [20]:
%matplotlib inline
import matplotlib
In [21]:
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)
In [22]:
df.plot.scatter(x='num_langs',
                y='num_links',
                c='num_chars',
                colormap='viridis')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x116422a58>

You can modify the figure.figsize parameter based on the size of your computer screen.

Interactive plotting with Bokeh

The static plot above is okay, but it's much more interesting to create an interactive graphic. To do this we will use the bokeh module. Load the required functions from bokeh below and specify that the output should appear within the Jupyter notebook.

In [23]:
from bokeh.plotting import figure, show, output_notebook, ColumnDataSource

output_notebook()
Loading BokehJS ...

Now, the code block below produces an interactive scatter plot. You can pan and zoom the plot depending on what parts of the plot you find interesting. Also, perhaps most importantly, if you hover over a point the name of the author associated with the point will show up. Try it!

In [24]:
TOOLTIPS = [
    ("Author", "@author_name"),
    ("Number Internal Links", "@num_links"),
    ("Number External Links", "@num_elinks"),
]

p = figure(plot_width=950,
           plot_height=500,
           tooltips=TOOLTIPS,
           tools="hover,pan,wheel_zoom,reset",
           toolbar_location="below",
           toolbar_sticky=True,
           active_scroll='wheel_zoom',
           title="American Authors - Wikipedia Data",
           x_axis_label="Number of Language Pages",
           y_axis_label="Number of Internal Links")

p.circle(x='num_langs',
         y='num_links',
         size=10,
         fill_alpha=0.5,
         source=ColumnDataSource(data=df))

show(p)

You won't understand all of the components of the plot immediately, but hopefully the example shows enough so that you could modify the plot to include different variables or a different set of information when hovering over the points.

Finally, the plot below makes use of the OpenURL and TapTool models to make the points clickable. Tapping on a point will open the Wikipedia page in a new tab. Try it now!

In [25]:
from bokeh.models import OpenURL, TapTool

TOOLTIPS = [
    ("Author", "@author_name"),
    ("Number Internal Links", "@num_links"),
    ("Number External Links", "@num_elinks"),
]

p = figure(plot_width=950,
           plot_height=500,
           tooltips=TOOLTIPS,
           tools="hover,pan,wheel_zoom,reset,tap",
           toolbar_location="below",
           toolbar_sticky=True,
           active_scroll='wheel_zoom',
           title="American Authors - Wikipedia Data",
           x_axis_label="Number of Language Pages",
           y_axis_label="Number of Internal Links")

p.circle(x='num_langs',
         y='num_links',
         size=10,
         fill_alpha=0.5,
         source=ColumnDataSource(data=df))

taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url="https://en.wikipedia.org/wiki/@url")

show(p)

Rather than a formal practice set of questions, today you will start working on your first project, which uses the methods developed in this tutorial to analyze a new dataset.

More practice

Feel free to check out the bokeh reference guide (note: it's huge!):

In particular check out the Gallary and demos. If you are interested in data visualization, there will be several of chances to build out interesting bokeh-based applications later this semester.