Grammar of Graphics
As you have seen in examples already, we will be using the ggplot2 package
for graphics in this course. The
gg standards for the Grammar of Graphics,
an influential theoretical structure for constructing statistical graphics
created by Leland Wilkinson:
To build a statistical graphic, we will be building different layers that fit together to produce plots. Each layer requires three elements:
- a geometry describing what type of layer is being added; for example, this might be a point, line, or text geometry
- a dataset from which to build the layer
- a mapping from variables in the dataset into elements called aesthetics that control the way the plot looks
Example with Hans Roslin’s data
To illustrate these points, let’s look at a subset of the data that Hans Roslin used in the video I showed on the first day of class. It contains just a single year of the data (2007).
Here is a plot similar to the one that Roslin use (I will show the code to construct it in just a few moments). Note that R writes the population key in scientific notation (2.5e+08 is the same as 2.5 time 10 to the power of eight).
Here, two of our three elements should be clear: the dataset is
gapminder_2007 and the plot uses the point geometry (
geom_point, which we
have already seen). How do the aesthetic elements function? There are four
visible aesthetics here, each matched to a particular variable in the dataset:
- the variable
gdp_per_capis mapped to the
- the variable
life_expis mapped to the
- the variable
continentis mapped to the
- the variable
popis mapped to the
Notice how each of these is shown in the resulting plot.
How do we write the code that actually produces this plot? Here is the full code written out in its entirety.
Equivalently, we can leave off the names in the first row. R knows by default that the first parameter should be data, the second is the x-variable, and the third is the y-variable.
Let’s dive a bit deeper into what this plot is doing. The first line sets up a base plot by defining the dataset and indicating which variables are associated with the x- and y-axes. To this line we add a geometry that lets R know that we want to include points on this plot. Within the points, we further want to assign color to change with the continent and size to change with the population. Note that these latter elements must be named; otherwise R will not know exactly which variables are being mapped to which aesthetics.
Recall that previously we did not define the color or size of the points. Leaving this out simply forces R to retain the default size (1) and color (black):
In some cases we want to change an aesthetic to a different fixed value than
the default. To do this, we include the specification of the aesthetic
outside of the
aes function. Here are points colored in blue:
It is possible to mix aesthetics so that some are mapped to variables and
others to fixed values. Simply specify the fixed values outside of the
function after the variable aesthetics. Here are small points with color
denoting the continent:
You’ll notice that I put the color blue in quotes but left the size specification as-is. This comes back to the notion of a data type in R. A fixed color is specified by a character, which has to be contained in quotes, but a size is given by number, which cannot be. Note: this applies only to a fixed value, not when assigning something by a variable.
The beauty of the grammar of graphics is that we can construct many plots by
combining together simple layers. The
geom_text is another layer type that
puts a label in place of a point. It requires a new (non-optional) aesthetic
label that describes which variable is used for the label. Here we
see it combined with the points layer:
Although it makes little sense here, we could also add a line plot to the graphic:
As we go through this material today, take particular note of the format for the next quiz.
Prototype and References
Some students, depending on their learning style, find it easiest to learn
from a prototype showing exactly how
ggplot2 commands are structured. In
the code below, anything in square brackets and captialised should be changed;
other elements should generally be kept as-is:
If you would like more references, here is a cheat-sheet and online notes that extend what we have done today:
These cover much more than we have shown today, and you are only responsible for the notes here. However, you may find the exercises and examples useful if this material is new to you.
We have covered a lot of new commands today. Practicing them is incredibly important to keeping up with this course. You will not learn how to do these properly without spending a reasonable amount of time practicing these skills outside of class. Download the lab08.Rmd file and work through the exercises. Upload your script (no need to include the HTML file) to GitHub ahead of the next class.