In this tutorial we use the **networkx** module to work with network/graph
objects in Python. To start, read in the modules and get the matplotlib graphics
engine running properly (if you have a smaller screen, feel free to adjust the
size of the plots).

In [1]:

```
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import networkx as nx
```

In [2]:

```
matplotlib.rcParams['figure.figsize'] = (17.0, 9.0)
```

We'll start by creating and empty graph object `G`

. Notice that
printing out the object does not show anything particularly
interesting.

In [3]:

```
G = nx.Graph()
G
```

Out[3]:

We use the method (a function attached to an object) `add_edge`

to
add things to the graph. The syntax involves giving the
names of two connected edges. Here, I'll make a graph of the three
(non-island) countries in North America and connect those that share
a border.

In [4]:

```
G.add_edge("United States", "Canada")
G.add_edge("United States", "Mexico")
```

Nodes are created as needed by the `add_edge`

function. We can see them using the
`nodes`

attribute of `G`

:

In [5]:

```
G.nodes
```

Out[5]:

Similarly, the `edges`

attribute shows the two edges in our graph.

In [6]:

```
G.edges
```

Out[6]:

We can plot a simple version of the graph using one of several plotting methods.
Here, I'll use the `draw_spring`

function, which should be good enough for most
use cases.

In [7]:

```
nx.draw_spring(G)
```

**Note: Run the code above a few times and/or look at your neighbors
plot. You should see that the results are not stable.** The algorithm is
stochastic and based on a physics simulation. The general shape should be
similar but the exact output changes each time the code is run.

We can make our plot more interesting using the `with_labels`

option.

In [8]:

```
nx.draw_spring(G, with_labels=True)
```

Now, let's create a slightly more complex network. This is known as the 'kite graph' given its shape. It is a simplified version of the kite graph from my notes.

In [9]:

```
G = nx.Graph()
G.add_edge("A", "B")
G.add_edge("B", "C")
G.add_edge("A", "C")
G.add_edge("C", "D")
G.add_edge("A", "D")
G.add_edge("D", "E")
nx.draw_spring(G, with_labels=True)
```

Plotting networks is a nice way to start understanding what's going on, but typically more useful in an analysis will be extracting network metrics. For example, we can associate with each node a centrality score.

Eigenvalue centrality scores for each node, $x_j$, are defined through a recursive relationship. Specifically, the eigenvalue centrality of a node is proportional to the sum of its neighbor's, $N(i)$, scores:

$$ x_j = \lambda \cdot \sum_{i \in N(j)} x_i $$

The value of $\lambda$ is an eigenvalue, hence the name (the details are
not important right now). We can compute these with the `eigenvector_centrality_numpy`

function:

In [10]:

```
nx.eigenvector_centrality_numpy(G)
```

Out[10]:

For this small example, its easy to see that 'A' and 'C' have the highest score and 'E' the smallest. We can automate this using the code below:

In [11]:

```
import operator
cent = nx.eigenvector_centrality_numpy(G)
sorted_x = sorted(cent.items(), key=operator.itemgetter(1), reverse=True)
sorted_x
```

Out[11]:

This will be useful for larger networks. Another, simplier, metric for determining the centrality of a node is the "degree centrality". It's proportional to the number of neighbors a node has.

In [12]:

```
cent = nx.degree_centrality(G)
sorted_x = sorted(cent.items(), key=operator.itemgetter(1), reverse=True)
sorted_x
```

Out[12]:

There are a few other centrality metrics that we saw in the notes, such as betweeness,
but I find computing them in **networkx** prohibitively slow.

Another algorithm we can apply to a graph determines what the communities are within the network. Roughly, a community is a set of nodes that are more connected with each other than they are with the remainer of the plot.

Here, we use the `asyn_fluidc`

function to split the graph into 3 communities.
Note that this algorithm is stochastic and you may have different results each
time you run the code.

In [13]:

```
from networkx.algorithms.community.asyn_fluidc import asyn_fluidc
communities = list(asyn_fluidc(G, 3))
communities
```

Out[13]:

While we have this small example, let' see how to inverse this process by assigning each node to its cluster.

In [14]:

```
communities_id = {}
for idx, val in enumerate(communities):
for k in val:
communities_id[k] = idx
communities_id
```

Out[14]:

In [15]:

```
import pandas as pd
df = pd.DataFrame(dict(title=list(communities_id.keys()),
community=list(communities_id.values())))
df = df.sort_values('title')
df = df.reset_index(drop=True)
df
```

Out[15]:

This will be useful in our next set of notes.

Now, let's build some more interesting graphs using our Wikipedia corpus. Start
by loading in the `iplot`

and `wiki`

modules.

In [16]:

```
import iplot
import wiki
assert iplot.__version__ >= 1
assert wiki.__version__ >= 3
```

We'll work with the Birthday Cake dataset first. Here is the data to get all of the pages linked to from the seed page:

In [17]:

```
data_json = wiki.get_wiki_json("Birthday_cake")
internal_links = wiki.links_as_list(data_json) + ['Birthday_cake']
internal_links[:10]
```

Out[17]:

Now, we want to create a graph object where the nodes are the pages stored in
`internal_links`

and edges connect any two pages that link to one another. How
would we go about doing this? We will need a `for`

loop to cycle through the pages.
But, on each page we **also** need to cycle through all of internal links on that
page. To do this, we'll need *nested loops*. That is, we have one loop inside of
another.

Here is an example of how a nested for loop works. To show this, I'll create an object
`A`

that consists of a "list of lists":

In [18]:

```
A = [['cow', 'chicken', 'duck'], ['banana', 'apple'], ['Virginia', 'Maryland']]
for items in A:
for x in items:
print(x + ' - ' + str(len(x)))
```

So, there isn't too much to the nested for loops. Now, try to use everything we've
learned to fix the code below to create a graph of all the internal links on the pages
in the list `internal_links`

. (Note: the resulting graph is probably not exactly what
we need, more on this below).

In [19]:

```
G = nx.Graph()
for link in internal_links:
data = wiki.get_wiki_json(link)
for new_link in data['links']:
G.add_edge(link, new_link['*'])
```

Now, how many edges are there in the graph?

In [20]:

```
len(G.edges)
```

Out[20]:

You should have thousands of edges. If you only have a few hundred, you didn't do the nested loops correctly.

Let's now see how many nodes there are in the graph:

In [21]:

```
len(G.nodes)
```

Out[21]:

Compare this with the number of internal links:

In [22]:

```
len(internal_links)
```

Out[22]:

Unless you did something wrong (or very clever) above, you should see that there are over 15k nodes in the graph even though there are 203 internal links.

**Question: Why is there this discrepency? (and if you somehow took care of the
issue already, try to think of what problem would cause someone to have too many
nodes)**

**Answer:** We are accidentally including *all* internal links, not just links to pages in
our original set `internal_links`

. We need to make sure that only these edges are included
in the graph.

I'll give you the answer to the question directly above: you probably included
all of the internal links on all of the pages. This would include, for example,
any link on any page linked to from the Birthday Cake page. We don't want all
of these links. Instead, we only want links where the starting and ending node
are in our set `internal_links`

.

Here, I'll walk you through two different ways to do this in Python using one page as an example. You'll then work one of these methods into your nested loop.

Let's take the 18th internal link, which should be the 'Birthday':

In [23]:

```
link = internal_links[17]
link
```

Out[23]:

I'll grab the JSON file for this page and extract its internal links:

In [24]:

```
data = wiki.get_wiki_json(link)
new_links = wiki.links_as_list(data)
new_links[:25]
```

Out[24]:

Many, perhaps even most, of these links are **not** in our set `internal_links`

.
We want to filter out only those that are. There are two ways to do this, and I
will show you both.

The first method uses the function `set`

to create a set object in Python. A set
is similar to a list, but cannot contain any duplicate values and does not have
any particular ordering. Sets in python allow for set arthimetic, so if we subtract
one set from another it will remove elements in the second from the first. For
example:

In [25]:

```
set(['a', 'b', 'c', 'd']) - set(['b', 'c'])
```

Out[25]:

Similarly, the `intersection`

method of a set find those elements that are common
between to sets:

In [26]:

```
set(['a', 'b', 'c', 'd']).intersection(set(['b', 'c']))
```

Out[26]:

So, what links from the Independent site are included in our list? Here we see this using sets:

In [27]:

```
list(set(new_links).intersection(set(internal_links)))
```

Out[27]:

You should see that it have just three links to other pages in our set.

The other way to solve this problem is to test whether a specific link
is in our list `internal_links`

. This can be done with the function `in`

for a particular element:

In [28]:

```
'Cake' in internal_links
```

Out[28]:

In [29]:

```
'Age of marriage' in internal_links
```

Out[29]:

How do we make use of this? We can use an `if`

statement which executes its
code only if a statement is `True`

. So:

In [30]:

```
for link in new_links:
if link in internal_links:
print(link)
```

Either way works for this application. The sets are a bit cleaner, but we'll need
`if`

statements in other applications so it's useful to see how they work here.

Now that we know how to only include the correct edges, here we will write code here to create the correct graph. It will be more interesting to work with the Richmond dataset in our analysis, so let's use that one now.

In [31]:

```
data_json = wiki.get_wiki_json("Richmond, Virginia")
internal_links = wiki.links_as_list(data_json) + ['Richmond, Virginia']
```

Now, write the code for constructing the graph here:

In [32]:

```
G = nx.Graph()
for link in internal_links:
data = wiki.get_wiki_json(link)
for new_link in data['links']:
if new_link['*'] in internal_links:
G.add_edge(link, new_link['*'])
```

When you're done, test that it has as many nodes as there are internal links.

In [33]:

```
print(len(G.nodes))
print(len(internal_links))
```

**Note:** There is one more internal link because we have a duplicate in the
dataset. I manually added a link to 'Richmond, Virgina', but the page already
cites itself. This will not effect our analysis.

Now, finally, let's actually analyze our data using the `networkx`

module.
To start, try to plot the network (it's possible that this will be too slow
on your machine, just try it out anyway).

In [34]:

```
nx.draw_spring(G, with_labels=True)
```