# Class 23: Creating Citation Graphs

## Citation Networks

Today we are going to build several types of graphs describing the relationship between Wikipedia pages. This builds nicely on the text analysis work that you have from project II (the code should actually be more straightforward as you’ll generally be able to take mine with only minor changes).

Let’s start with building a citation graph. That is, our nodes will be Wikipedia pages and two pages will share an edge if there is a link between their pages. To start, I’ll select the Richmond, Virginia page. We’ll print out the links from the first three paragraphs:

Notice that we do not want links that contain a colon (“:”) as these are special Wikipedia pages. Filtering these out and taking the unique links gives a fairly large set of pages:

As with the webscraping code, we’ll cycle over all of these links and download each page. Here, I’ll just extract all links from those pages and built a matrix of all links on every page mentioned on the Richmond, Virginia one.

This can take a little time, so I included a counter that prints out the progress of the algorithm.

Notice that many of the links on these 538 pages are to other pages to in our set. On average only 6% of the links are in our collection:

For now, we only want those edges that point to another page in our initial set. We will construct a manual edge list from this:

Now, I will use the graph_data function to extract network data from the edge list. Notice that not passing a node list creates one automatically from the edges (however, there will not be any additional metadata in this case to work with):

Plotting the network we can get an idea of what the citation graph looks like:

We could make this interactive to try to read the node names, which is helpful. Alternatively we can work with the nodes data as a table, ordering from the largest to the smallest eigenvalue centrality score:

What do you notice about the most central pages? Remember, Richmond is not included in the set, so it is not surprising that it’s missing.

Plotting eigenvalue and betweenness scores, we see that VCU and “Geographic Coordinate System” are the gatekeepers in this network:

What are the clusters? Let’s select the clusters with more than five pages and find the most central nodes within each:

Could you describe any of these with a short title?

## Co-citation

An alternative method for constructing a citation network is to form the co-citation network. A co-citation connects two nodes if other pages cite both of them together. We can get co-citations using the co_cite function:

You can toy around with how many counts you want to include before connecting two pages. Here I’ll just use 3 or more counts:

The graph looks quite a bit different now:

And the most central nodes have changed as well:

Let’s join the two datasets together to find the largest differences:

The eigenvalue scores are certainly correlated, but not a perfect copy of one another.

Here are the most different nodes between the two measurements:

Do you notice any patterns about these? Where do you think they tend to appear most?

## Text similarity

Citations are not the only way of creating a graph over the Wikipedia entries. An alternative method is to use the text itself to connect pages that have similar content. Let’s start with the Statistics page, grabbing the links again:

Now, we will cycle over these pages but instead of grabbing links I’ll just save the text (without the HTML codes):

Now, I’ll use the tokenization code to create a data frame showing the “distance” between any two documents. The distance is a function of how similar the words are between the documents

Then, I’ll create an edge list by including only those documents that are a distance of 45 away from each other (I selected this cut-off by trial and error):

And you can see that the resulting graph is very centered on a tight clustering of documents:

Alternatively, we can connect each page to its k-nearest neighbors. Here, I’ll link to the closest 3 pages:

This graph looks quite a bit more interesting and spread out:

Can you make any sense of the topics here?

Changing the cut-off value significantly effects the output of the model. You’ll have a chance to experiment with this in Project 3.