Today we will discuss how to construct new datasets as a subset of a current
NYC Flights Data
Often is will be useful to take a subset of a dataset. This can be useful
if we are only interested in a particular part of the dataset; it can also
be used if we want to create one visualization layers that highlights where
one subset of data lies within another.
To illustrate how this works, today we will explore a dataset of the every
commercial flight that departed from New York City in 2013 (we’ll use this
many times as it is a great teaching sample):
We will discuss a few different ways of taking a subset of the data, termed
filtering, before showing how this approach can integrated into other
The general syntax for filtering data in R is to use the following, where
expression is a logical statement about variables in the dataset OLD:
For example, to find flights where the departure time is greater than
2300 (the times are in a 24 hour format):
Notice that the new dataset has only 2573 rows, much smaller than the starting
dataset. Similar expressions exist for other numeric comparisons: < (less
than), >= (greater than or equal), and <= (less than or equal). Similarly
we can compare whether one variable is exactly equal to a particular value.
For this we need to use ==, not a single equal sign:
Here we have flights that only take off exactly at 11pm. The symbol !=
detects whether a value is not equal to a particular value:
The == and != symbols also work for character and date variables, however
you’ll need to make sure to enclose the comparison value (not the variable) in
We can detect whether a variable is equal to a set of values using the %in%
and c functions:
These approaches here should get you through most of your needs in filtering
datasets. Anything else can be gotten by making use of the & (and),
| (or), and ! (negation). Do not worry about these now; if you have a
need to use them on a project or lab I will show you then.
Graphing filtered data
If you want to just use the filtered data, this can be done straightforwardly
in R by simply specifying the correct dataset in the first parameter of the
ggplot command. But what if you want to use a subset of the data in only
Within geom layer we can override the data = option to use a different
dataset than specified in the first line. I recommend putting this as the end
of the geom layer:
This shows all of the Richmond flights in red on top of the remainder of the
flights. Combined with annotations, these techniques can create very
professional looking graphics.
To filter dates and date time objects we can also use the numeric comparison
operators such as > and <. However, we have to convert the thing we are
comparing to a date object using either as.Date (for just date data) or
as.POSIXct (for date time data).
For example, here is a way of filtering the flights dataset using the
The special function between allows us to grab a range of dates:
We will, once again, work on a lab for the remainder of the class:
Upload your script to GitHub ahead of the next class.