Tutorial 14: Networks and Algorithms

In this tutorial we use the networkx module to work with network/graph objects in Python. To start, read in the modules and get the matplotlib graphics engine running properly (if you have a smaller screen, feel free to adjust the size of the plots).

In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import networkx as nx
In [2]:
matplotlib.rcParams['figure.figsize'] = (17.0, 9.0)

Graph objects

We'll start by creating and empty graph object G. Notice that printing out the object does not show anything particularly interesting.

In [3]:
G = nx.Graph()
G
Out[3]:
<networkx.classes.graph.Graph at 0xa20239cf8>

We use the method (a function attached to an object) add_edge to add things to the graph. The syntax involves giving the names of two connected edges. Here, I'll make a graph of the three (non-island) countries in North America and connect those that share a border.

In [4]:
G.add_edge("United States", "Canada")
G.add_edge("United States", "Mexico")

Nodes are created as needed by the add_edge function. We can see them using the nodes attribute of G:

In [5]:
G.nodes
Out[5]:
NodeView(('United States', 'Canada', 'Mexico'))

Similarly, the edges attribute shows the two edges in our graph.

In [6]:
G.edges
Out[6]:
EdgeView([('United States', 'Canada'), ('United States', 'Mexico')])

Plotting the network

We can plot a simple version of the graph using one of several plotting methods. Here, I'll use the draw_spring function, which should be good enough for most use cases.

In [7]:
nx.draw_spring(G)

Note: Run the code above a few times and/or look at your neighbors plot. You should see that the results are not stable. The algorithm is stochastic and based on a physics simulation. The general shape should be similar but the exact output changes each time the code is run.

We can make our plot more interesting using the with_labels option.

In [8]:
nx.draw_spring(G, with_labels=True)

A kite graph

Now, let's create a slightly more complex network. This is known as the 'kite graph' given its shape. It is a simplified version of the kite graph from my notes.

In [9]:
G = nx.Graph()
G.add_edge("A", "B")
G.add_edge("B", "C")
G.add_edge("A", "C")
G.add_edge("C", "D")
G.add_edge("A", "D")
G.add_edge("D", "E")

nx.draw_spring(G, with_labels=True)

Centrality

Plotting networks is a nice way to start understanding what's going on, but typically more useful in an analysis will be extracting network metrics. For example, we can associate with each node a centrality score.

Eigenvalue centrality scores for each node, $x_j$, are defined through a recursive relationship. Specifically, the eigenvalue centrality of a node is proportional to the sum of its neighbor's, $N(i)$, scores:

$$ x_j = \lambda \cdot \sum_{i \in N(j)} x_i $$

The value of $\lambda$ is an eigenvalue, hence the name (the details are not important right now). We can compute these with the eigenvector_centrality_numpy function:

In [10]:
nx.eigenvector_centrality_numpy(G)
Out[10]:
{'A': 0.5370769634854428,
 'B': 0.40669371006285376,
 'C': 0.5370769634854425,
 'D': 0.47474973908441287,
 'E': 0.17974866347516058}

For this small example, its easy to see that 'A' and 'C' have the highest score and 'E' the smallest. We can automate this using the code below:

In [11]:
import operator

cent = nx.eigenvector_centrality_numpy(G)
sorted_x = sorted(cent.items(), key=operator.itemgetter(1), reverse=True)
sorted_x
Out[11]:
[('C', 0.5370769634854425),
 ('A', 0.5370769634854424),
 ('D', 0.47474973908441304),
 ('B', 0.406693710062854),
 ('E', 0.1797486634751605)]

This will be useful for larger networks. Another, simplier, metric for determining the centrality of a node is the "degree centrality". It's proportional to the number of neighbors a node has.

In [12]:
cent = nx.degree_centrality(G)
sorted_x = sorted(cent.items(), key=operator.itemgetter(1), reverse=True)
sorted_x
Out[12]:
[('A', 0.75), ('C', 0.75), ('D', 0.75), ('B', 0.5), ('E', 0.25)]

There are a few other centrality metrics that we saw in the notes, such as betweeness, but I find computing them in networkx prohibitively slow.

Communities

Another algorithm we can apply to a graph determines what the communities are within the network. Roughly, a community is a set of nodes that are more connected with each other than they are with the remainer of the plot.

Here, we use the asyn_fluidc function to split the graph into 3 communities. Note that this algorithm is stochastic and you may have different results each time you run the code.

In [13]:
from networkx.algorithms.community.asyn_fluidc import asyn_fluidc

communities = list(asyn_fluidc(G, 3))
communities
Out[13]:
[{'C'}, {'A', 'B'}, {'D', 'E'}]

While we have this small example, let' see how to inverse this process by assigning each node to its cluster.

In [14]:
communities_id = {}
for idx, val in enumerate(communities):
    for k in val:
        communities_id[k] = idx
        
communities_id
Out[14]:
{'C': 0, 'A': 1, 'B': 1, 'D': 2, 'E': 2}
In [15]:
import pandas as pd
df = pd.DataFrame(dict(title=list(communities_id.keys()),
                       community=list(communities_id.values())))
df = df.sort_values('title')
df = df.reset_index(drop=True)
df
Out[15]:
title community
0 A 1
1 B 1
2 C 0
3 D 2
4 E 2

This will be useful in our next set of notes.

Wikipedia graphs

Now, let's build some more interesting graphs using our Wikipedia corpus. Start by loading in the iplot and wiki modules.

In [16]:
import iplot
import wiki

assert iplot.__version__ >= 1
assert wiki.__version__ >= 3
Loading BokehJS ...

We'll work with the Birthday Cake dataset first. Here is the data to get all of the pages linked to from the seed page:

In [17]:
data_json = wiki.get_wiki_json("Birthday_cake")
internal_links = wiki.links_as_list(data_json) + ['Birthday_cake']
internal_links[:10]
Out[17]:
['Atlantic magazine',
 'The Independent',
 'Tteok',
 'Ancient Rome',
 'Angel cake',
 'Angel food cake',
 'Apple cake',
 'Avocado cake',
 'Babka (cake)',
 'Banana bread']

Now, we want to create a graph object where the nodes are the pages stored in internal_links and edges connect any two pages that link to one another. How would we go about doing this? We will need a for loop to cycle through the pages. But, on each page we also need to cycle through all of internal links on that page. To do this, we'll need nested loops. That is, we have one loop inside of another.

Here is an example of how a nested for loop works. To show this, I'll create an object A that consists of a "list of lists":

In [18]:
A = [['cow', 'chicken', 'duck'], ['banana', 'apple'], ['Virginia', 'Maryland']]

for items in A:
    for x in items:
        print(x + ' - ' + str(len(x)))
cow - 3
chicken - 7
duck - 4
banana - 6
apple - 5
Virginia - 8
Maryland - 8

So, there isn't too much to the nested for loops. Now, try to use everything we've learned to fix the code below to create a graph of all the internal links on the pages in the list internal_links. (Note: the resulting graph is probably not exactly what we need, more on this below).

In [19]:
G = nx.Graph()

for link in internal_links:
    data = wiki.get_wiki_json(link)
    for new_link in data['links']:
        G.add_edge(link, new_link['*'])

Now, how many edges are there in the graph?

In [20]:
len(G.edges)
Out[20]:
48136

You should have thousands of edges. If you only have a few hundred, you didn't do the nested loops correctly.

Let's now see how many nodes there are in the graph:

In [21]:
len(G.nodes)
Out[21]:
17051

Compare this with the number of internal links:

In [22]:
len(internal_links)
Out[22]:
203

Unless you did something wrong (or very clever) above, you should see that there are over 15k nodes in the graph even though there are 203 internal links.

Question: Why is there this discrepency? (and if you somehow took care of the issue already, try to think of what problem would cause someone to have too many nodes)

Answer: We are accidentally including all internal links, not just links to pages in our original set internal_links. We need to make sure that only these edges are included in the graph.

Testing inclusion

I'll give you the answer to the question directly above: you probably included all of the internal links on all of the pages. This would include, for example, any link on any page linked to from the Birthday Cake page. We don't want all of these links. Instead, we only want links where the starting and ending node are in our set internal_links.

Here, I'll walk you through two different ways to do this in Python using one page as an example. You'll then work one of these methods into your nested loop.

Let's take the 18th internal link, which should be the 'Birthday':

In [23]:
link = internal_links[17]
link
Out[23]:
'Birthday'

I'll grab the JSON file for this page and extract its internal links:

In [24]:
data = wiki.get_wiki_json(link)
new_links = wiki.links_as_list(data)
new_links[:25]
Out[24]:
['ABC News',
 'Age of consent',
 'Age of majority',
 'Age of marriage',
 'Anniversary',
 'Anzac Day',
 'Australia',
 'Bar mitzvah',
 'Bat mitzvah',
 'Birth',
 'Birth certificate',
 'Birthday (disambiguation)',
 'Birthday attack',
 'Birthday card',
 'Birthday paradox',
 'Birthday party',
 'Birthstones',
 'British Royal Family',
 'Buddha',
 "Buddha's Birthday",
 "Buddha's birthday",
 'Caesarean section',
 'Cake',
 'Canada',
 'Canterbury Tales']

Many, perhaps even most, of these links are not in our set internal_links. We want to filter out only those that are. There are two ways to do this, and I will show you both.

The first method uses the function set to create a set object in Python. A set is similar to a list, but cannot contain any duplicate values and does not have any particular ordering. Sets in python allow for set arthimetic, so if we subtract one set from another it will remove elements in the second from the first. For example:

In [25]:
set(['a', 'b', 'c', 'd']) - set(['b', 'c'])
Out[25]:
{'a', 'd'}

Similarly, the intersection method of a set find those elements that are common between to sets:

In [26]:
set(['a', 'b', 'c', 'd']).intersection(set(['b', 'c']))
Out[26]:
{'b', 'c'}

So, what links from the Independent site are included in our list? Here we see this using sets:

In [27]:
list(set(new_links).intersection(set(internal_links)))
Out[27]:
['Cake', 'Happy Birthday to You', 'Rite of passage']

You should see that it have just three links to other pages in our set.

The other way to solve this problem is to test whether a specific link is in our list internal_links. This can be done with the function in for a particular element:

In [28]:
'Cake' in internal_links
Out[28]:
True
In [29]:
'Age of marriage' in internal_links
Out[29]:
False

How do we make use of this? We can use an if statement which executes its code only if a statement is True. So:

In [30]:
for link in new_links:
    if link in internal_links:
        print(link)
Cake
Happy Birthday to You
Rite of passage

Either way works for this application. The sets are a bit cleaner, but we'll need if statements in other applications so it's useful to see how they work here.

Wikipedia graphs II

Now that we know how to only include the correct edges, here we will write code here to create the correct graph. It will be more interesting to work with the Richmond dataset in our analysis, so let's use that one now.

In [31]:
data_json = wiki.get_wiki_json("Richmond, Virginia")
internal_links = wiki.links_as_list(data_json) + ['Richmond, Virginia']

Now, write the code for constructing the graph here:

In [32]:
G = nx.Graph()

for link in internal_links:
    data = wiki.get_wiki_json(link)
    for new_link in data['links']:
        if new_link['*'] in internal_links:
            G.add_edge(link, new_link['*'])

When you're done, test that it has as many nodes as there are internal links.

In [33]:
print(len(G.nodes))
print(len(internal_links))
1130
1131

Note: There is one more internal link because we have a duplicate in the dataset. I manually added a link to 'Richmond, Virgina', but the page already cites itself. This will not effect our analysis.

Analysis

Now, finally, let's actually analyze our data using the networkx module. To start, try to plot the network (it's possible that this will be too slow on your machine, just try it out anyway).

In [34]:
nx.draw_spring(G, with_labels=True)