Today, we’ll look at three additional examples of networks.
Pay attention as you’ll be selecting from amongst these for
the third and final project. Notice that all of these, like
the Supreme Court citations, are too large to look at all at
once and you’ll need to subset the edges or nodes.
For those of you interested in sports data, I have two datasets constructed from
Major League Baseball. The nodes in the first are the various MLB franchises:
And the edges indicate, in a given year, how many players on one team came from
To do something interesting with this, you’ll need to take a subset of the years
and (likely) truncate to only those edges with a large enough count. Here, I’ll
look at 2010 and counts above 10:
Interesting questions include local effects (how does this graph change over
a specific decade), or how does it change over a long time period. For example,
here is the graph from before the modern free-agency era:
The second baseball dataset is similar, but includes links between MLB teams and
college teams for a given year:
Take a look at the data from 1950:
I can see some regional effects here in the 1950’s graph (the Red Sox have the only players from
Providence and UConn, for example). Richmond even has a player on the Yankees roster!
The second set of graph data concerns RFID tags from a French Hospital system over the
course of 8 days. The nodes consist of patients, nurses, administrators, and physicians:
The edges indicate whenever two entities came in contact with one another in a given 20-second
Interesting relationships can be understood by looking at the graph over various
A particularly interesting approach could collect summary statistics over particular hours
and then plot that data. There is a lot of potential here, though it will take some
digging into the dataset to find it.
The final dataset comes from character relationships from Shakespeare’s plays. There are
two different sets of edges (there is no seperate nodes table), depending on whether links
should indicate whether two characters talk to one another or appear within a fixed number
of words of one another:
There is a seperate element for each play. As you can see, connections have scores that you
could use to filter to only the strongest relationship. Here is the network from “A Midsummer
The clusters line up well to the different aspects of the play.
And here is the same set of characters using the speech network:
You can study and compare multiple plays; here is Romeo and Juliete:
An interesting project would be to study the differences and similarities between the plays.
Do they themselves relate in any way?
The third project requires you do a data analysis based on network data. Your analysis
should not focus on just a single network, but should consist of comparing multiple
networks to find more interesting meta-patterns. I am open to other suggestions, but
generally I recommend that you use one of the following datasets:
Wikipedia links data (double hops; text vs. citation vs. co-citation; contrast different starting points)
Baseball datasets (compare college vs. pro; look across years; apply different cut-offs)
Shakespeare plays (compare speech vs. time; play with the cut-off; compare plays, perhaps clustered on type:
comedy, tragedy, history)
RFID data (look at the graph over time; compare across days, hours, types, and individuals)
Supreme Court citations (look at the graph for various issues, perhaps over time, using different cut-offs and looking at citation and co-citation graphs)
The end goal is to find something interesting and relay these interesting ideas through
graphics and/or models through your data analysis report. This should more closely resemble
the first data analysis rather than the second one (i.e., there should be a thesis rather than
an hypothesis). We will have presentations on these reports during the final week of the term.