Class 22: Working with Network Data
Our last topic for this semester will be studying network data. I originally said it would be spatial data, but I think you’ll find this more interesting.
Network data consists of two types of objects:
- nodes, the primary object of study
- edges, connections between nodes
These two things can in theory be any set of objects with links between them. Some common examples include:
- social networks: nodes are people and edges describe relationships
- family trees: nodes are people and edges describe marriage or child-parent relationships
- citation networks: nodes are papers/articles/books and edges are citations from one work to another
- similarity network: nodes are almost any entity and edges describe objects that are particularly similar
Today we will see some simple ways of working with network data. Over this and the next two classes we will see how to construct, manipulate, and visulize various types of networks.
Note that you may have heard of what I call networks under the term of a graph. This is the common name in the mathematical study. In math nodes are often called verticies. You’ll also hear edges called both links and connections. These all represent the same structures, though the kinds of questions mathematicians are concerned with are generally distinct from those of interest in statistics.
Our first network dataset is an artificial network created to illustrate some of the primary properties. The graph is called a “kite” because of its shape. We will move to more interesting data soon.
In order to describe a network dataset, we need two tables. The first describes the
edges, with the first column giving an
id for the first node and the second giving
id_out for the second node. This implies that there is an edge between
Edge datasets may also contain additional metadata describing the specific relationship. There is also a table describing the nodes. This starts with a column giving the id of the node, with other columns indicating metadata about the node. Here we have the age of the person and whether they are a white, green, or black belt (we’ll assume the relationships indicate friendships within a karate class).
These two tables fully describe the network, but there is not a lot that we can do directly with them. We need to add learned network data to both the edges and nodes. To do this, you’ll need to re-install the smodels package and install the igraph package:
Once installed, load the smodels package and pass the edges and nodes to the
The output is a list of two tables; we can extract them in order to look at what they contain. The nodes now have many additional columns describing features of the network. We’ll cover this in a moment:
And the edges now have coordinates describing how we could plot the data:
Two important new variables in the
nodes dataset are
y. These indicate
where the nodes should be plotted. These were determined by a graph layout algorithm
called Fruchterman–Reingold. We can see the layout of the nodes here (I’ll write the
code in terms of the object
gr as we will not usually extract the tables directly):
We see that the nodes are spread out over the space in a reasonably nice pattern. I’ve used the
theme_void because I have no need for the specific values of
y. Now let’s use
the edges dataset to add the links between the nodes. I’ll also add names directly
below each node and color the nodes based on what karate belt each person has earned.
Notice that the layout made linked nodes close together and minimizes crossing of edges. The algorithm actually uses a physics simulation where edges are treated as springs and nodes as positively charged particles. Hence, the nodes want to spread out but the edge push together neighboring nodes.
You’ll probably find it useful to create interactive graphics. For instance, this creates an interactive plot with the names created when scrolling over the data.
We’ll now take a look at some of the other metadata included with the
Many of the variables concern the centrality of a particular node. There are a few different
ways of measuring this.
Eigenvalue centrality constructs a score for each node such that each nodes centrality score is proportional to the sum of its neighbors scores. In math:
So the more neighbors a node has the higher it’s centrality, though neighbors that are central themselves contribute more than nodes that are not otherwise very central. We won’t go into the many today, but this can be solved by computing the eigenvalue problem for the adjacency matrix of the graph. By convention, the maximum eigen centrality score is equal to 1 (otherwise, we could multiple all scores by a constant but keep the same relationship):
In this measurment, we see that Dieter is the most central node. He is right in the middle of the cluster of most densely connection points.
Another centrality measurement is called betweenness. Consider all shortest paths between all pairs of nodes in the network. Betweenness measures, for each node, the proportion of these paths that run through a given node. We can plot this as well:
The score in R is scaled by a factor depending on the number of nodes in the graph (generally, we only care about the relative values anyway). Notice now that Hellen is much more central because she connects everyone in the graph to Ida.
Another measurement of centrality is called closeness. It is defined as the average distance (number of hops) from each node to a given node of interest.
Here Gaby, Flavia, and Dieter are equally central. Flavia and Gaby have fewer direct links, but are closer to Ida than Dieter.
Finally, we can also use degree centrality. This simply measures how many neighbors a graph has.
Notice that these measures agree to some extent but do measure slightly different aspects of centrality. Nodes that have a high betweenness but relatively low eigen centrality are known as gatekeepers (such as Hellen above). These are the influencial people who link clusters within a graph.
Speaking of clusters, the varible
cluster gives a clustering of the graph nodes into a
given number of groups.
Here there is a large primary group (the blue one) and the small subgroup of Hellen and Ida (the red one).
Let’s explore these ideas on a larger social network dataset. This consists of employees at a small fictional company:
There are several variables describing the nature of the relationship between co-workers. We know whether the employees are friends, report to each other, and whether they seek each other out for advice.
We can build a graph showing friendship relationships by filtering on friendships:
We see that there is not a very strong age-dependency to the friendship graph:
We can also look at the reporting relationships. Here, however, the edges have a direction
to them (one person reports to the other person). We can specify this in the option
in the function
We can graph the data with arrows by specifying the
arrow option in
Notice that there is are variables
degree_out the look at the difference
between being close to other nodes as inputs or outputs. For example:
The out degree is one for everyone other than the head of the company:
Generally, I don’t worry much about direction in networks and treat all networks as undirected. Occasionally it can be useful, however, to consider the egges as only occuring in one direction.
While there are certain types of networks commonly seen and used (such as citations graphs and forms of social networks), this need not be the only kind of graph that we can construct. Here is a network of common English nouns and adjectives with words connected that tend to co-occur.
In the second and third network lectures we will see how to create various networks ourselves from data.