## Learning Objectives

- Create a new version of a dataset by taking a subset of the observations
- Apply binary operators such as
`>`

,`>=`

,`==`

, and`%in%`

to specify a filtering action - Use the
`between`

and`as.POSIXct`

to subset observations based on a date or date-time variable

## NYC Flights Data

Often is will be useful to take a subset of a dataset. This can be useful if we are only interested in a particular part of the dataset; it can also be used if we want to create one visualization layers that highlights where one subset of data lies within another.

To illustrate how this works, today we will explore a dataset of the every commercial flight that departed from New York City in 2013 (we’ll use this many times as it is a great teaching sample):

We will discuss a few different ways of taking a subset of the data, termed
**filtering**, before showing how this approach can integrated into other
analyses.

## Filtering data

The general syntax for filtering data in R is to use the following, where
expression is a logical statement about variables in the dataset `OLD`

:

For example, to find flights where the departure time is greater than 2300 (the times are in a 24 hour format):

Notice that the new dataset has only 2573 rows, much smaller than the starting
dataset. Similar expressions exist for other numeric comparisons: `<`

(less
than), `>=`

(greater than or equal), and `<=`

(less than or equal). Similarly
we can compare whether one variable is exactly equal to a particular value.
For this we need to use `==`

, not a single equal sign:

Here we have flights that only take off *exactly* at 11pm. The symbol `!=`

detects whether a value is **not** equal to a particular value:

The `==`

and `!=`

symbols also work for character and date variables, however
you’ll need to make sure to enclose the comparison value (not the variable) in
quotation marks:

We can detect whether a variable is equal to a set of values using the `%in%`

and `c`

functions:

These approaches here should get you through most of your needs in filtering
datasets. Anything else can be gotten by making use of the `&`

(and),
`|`

(or), and `!`

(negation). Do not worry about these now; if you have a
need to use them I will show you then.

## Graphing filtered data

If you want to just use the filtered data, this can be done straightforwardly
in R by simply specifying the correct dataset in the first parameter of the
`ggplot`

command. But what if you want to use a subset of the data in only
one plot?

Within `geom`

layer we can override the `data = `

option to use a different
dataset than specified in the first line. I recommend putting this as the end
of the geom layer:

This shows all of the Richmond flights in red on top of the remainder of the flights. Combined with annotations, these techniques can create very professional looking graphics.

## Filtering dates

To filter dates and date time objects we can also use the numeric comparison
operators such as `>`

and `<`

. However, we have to convert the thing we are
comparing to a date object using either `as.Date`

(for just date data) or
`as.POSIXct`

(for date time data).

For example, here is a way of filtering the flights dataset using the
`time_hour`

variable:

The special function `between`

allows us to grab a range of dates:

Or, similarly, as a date time object: