Today we will once again look at the NYC flights dataset:
This time we will also look at several other tables that describe
various aspects of the flights. Most of the names of these
should be self-explanatory:
Relational data keys
All of these tables are connected by way of common shared
variables known in database theory as keys. A key may be
a single variable or a collection of variables (known as
a composite key). There are two types of keys:
a primary key uniquely identifies a row in a data table
a foreign key uniquely identifies a row in another data table
An individual variable may be part of both a primary and foreign
key even within the same table. Notice that variables in a key
may have a different name in another table. In fact, they may
even map to multiple variables in another table. An example is the
faa code in the airports dataset, which maps to both the
origin and destination fields in the flights dataset.
The diagram below shows all of the connections between the five
tables that we have loaded:
A primary key and the corresponding foreign key in another table form a relation.
Typically a relation maps a single row in one dataset to many rows in another.
For example, each flight has one origin, but each origin has many flights.
Two table verbs
Recall that we called the functions from dplyr that take a dataset and
return another dataset verbs. Specifically, all of the verbs we have learned
so far are one table verbs as they take just a single table as an input.
When working with relational data we need verbs that take two tables and
return some sort of new output. These are called two table verbs. In
case you are wondering, there are no three table verbs. To work with more
than two tables, simply chain together multiple two table verbs.
Mutating joins take two tables and combine them into a single table
that contains all variables in either original table. We can use them
for operations such as adding weather information to the flights dataset.
Before moving forward, I’ll create a smaller version of the flights
dataset to make sure that this prints out well in the HTML notes:
In my experience, the two table verb left_join is by far the most
commonly used way of combining two tables. The function takes two
tables as its first two arguments, followed by the input by that
describes which variables are going to be used to join the tables.
Here we see that it returns a new table with the same number of rows
as the flights_sml dataset but now with the full name of the carrier:
As mentioned before, the key of interest may have different names in each
data set. The by option can be modified to specific the name of the key
in the first table and the name of the key in the second table. Here we
use it to combine the flights dataset with the airports data:
Notice that the destination for the fourth row did not match
with any rows of the airports data. The airport “BQN” was not found
(if you are curious, BQN is in Aguadilla, Puerto Rico; presumably
the airports data onlyincludes those in the 50 US states).
Inner and outer joins
There are four mutating joins available in dplyr. These have the
same syntax. If the join key is a primary key in one table and there
are no missing matches, they all function exactly the same. If the
join key is a primary key in one table and not all foreign keys in
the other table match an entry in the other, the following happens:
left_join only non-matching rows in the first dataset are kept
right_join only non-matching rows in the second dataset are kept
full_join non-matching rows are included from either table
inner_join only matching rows from both tables are included
All of this can become quite complicated. I follow one simple rule
that make this much easier:
make sure that the key used for the join is a primary key for
the second dataset; that is, it unique defines each row of the
second dataset (usually this dataset contains metadata about the
If you follow this rule, you only ever need left_join (if you are
okay with missing values in the metadata) and inner_join
(if you want to remove rows that did not have associated metadata).
Recall that a key may consist of multiple variables. To
join on many variables at once, provide a vector of all the names
to the option by:
Sometimes there may be a variable name present in two datasets that
we want to merge together but that has a different meaning in each
dataset. For example, year is the year of the flight in the flights
dataset but the year of creating in the airplanes dataset. If
we join these two, we see that a suffix is added to each variable:
This behavior is fine, but it is better to manually describe
what the suffix should be:
If we do not specify the key to join with in the by option, dplyr
will assume that we want to join on all common keys. A warning will be
produced warning which variables were choosen. This can be useful in a
pinch, but generally is a bad idea to rely on.
There are two additional two table verbs that act like the filter
command, returning a subset of the rows, rather than like mutate,
which adds new columns. Specifically:
semi_join(x, y) keeps all rows in x that have a match in y
anti_join(x, y) keeps all rows in x that do not have a match in y
Unlike mutating joins, we do not need to be nearly as careful to make
sure that the joining key is a primary key for the second dataset.
We will see that filtering joins are particularly useful when working
with spatial and text data.