# Class 10: Filtering Data

## Learning Objectives

• Create a new version of a dataset by taking a subset of the observations
• Apply binary operators such as >, >=, ==, and %in% to specify a filtering action
• Use the between and as.POSIXct to subset observations based on a date or date-time variable

## NYC Flights Data

Often is will be useful to take a subset of a dataset. This can be useful if we are only interested in a particular part of the dataset; it can also be used if we want to create one visualization layers that highlights where one subset of data lies within another.

To illustrate how this works, today we will explore a dataset of the every commercial flight that departed from New York City in 2013 (we’ll use this many times as it is a great teaching sample):

We will discuss a few different ways of taking a subset of the data, termed filtering, before showing how this approach can integrated into other analyses.

## Filtering data

The general syntax for filtering data in R is to use the following, where expression is a logical statement about variables in the dataset OLD:

For example, to find flights where the departure time is greater than 2300 (the times are in a 24 hour format):

Notice that the new dataset has only 2573 rows, much smaller than the starting dataset. Similar expressions exist for other numeric comparisons: < (less than), >= (greater than or equal), and <= (less than or equal). Similarly we can compare whether one variable is exactly equal to a particular value. For this we need to use ==, not a single equal sign:

Here we have flights that only take off exactly at 11pm. The symbol != detects whether a value is not equal to a particular value:

The == and != symbols also work for character and date variables, however you’ll need to make sure to enclose the comparison value (not the variable) in quotation marks:

We can detect whether a variable is equal to a set of values using the %in% and c functions:

These approaches here should get you through most of your needs in filtering datasets. Anything else can be gotten by making use of the & (and), | (or), and ! (negation). Do not worry about these now; if you have a need to use them on a project or lab I will show you then.

## Graphing filtered data

If you want to just use the filtered data, this can be done straightforwardly in R by simply specifying the correct dataset in the first parameter of the ggplot command. But what if you want to use a subset of the data in only one plot?

Within geom layer we can override the data =  option to use a different dataset than specified in the first line. I recommend putting this as the end of the geom layer:

This shows all of the Richmond flights in red on top of the remainder of the flights. Combined with annotations, these techniques can create very professional looking graphics.

## Filtering dates

To filter dates and date time objects we can also use the numeric comparison operators such as > and <. However, we have to convert the thing we are comparing to a date object using either as.Date (for just date data) or as.POSIXct (for date time data).

For example, here is a way of filtering the flights dataset using the time_hour variable:

The special function between allows us to grab a range of dates:

Or, similarly, as a date time object:

This will be useful if you want to subset the data you had from the first project.

## Using ggmap

As a final note today, let’s look at the ggmap package for ploting spatial data. First, read in the dataset of airports (I’ll also filter out some airports in strange places):

Now, if we want to plot the locations of all of the airports, read in the ggmap package:

To construct a map plot, use the following line in place of your usual call to ggplot:

Then, you can add points or text as usual:

I really like spatial data and think the idea behind ggmap is great. However, I’ll admit that the package is not very well written or maintained, so your mileage may vary.