Class 10: Filtering Data

Learning Objectives

  • Create a new version of a dataset by taking a subset of the observations
  • Apply binary operators such as >, >=, ==, and %in% to specify a filtering action
  • Use the between and as.POSIXct to subset observations based on a date or date-time variable

NYC Flights Data

Often is will be useful to take a subset of a dataset. This can be useful if we are only interested in a particular part of the dataset; it can also be used if we want to create one visualization layers that highlights where one subset of data lies within another.

To illustrate how this works, today we will explore a dataset of the every commercial flight that departed from New York City in 2013 (we’ll use this many times as it is a great teaching sample):

flights <- read_csv("https://statsmaths.github.io/stat_data/flights.csv")
flights
## # A tibble: 327,346 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 327,336 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <dttm>

We will discuss a few different ways of taking a subset of the data, termed filtering, before showing how this approach can integrated into other analyses.

Filtering data

The general syntax for filtering data in R is to use the following, where expression is a logical statement about variables in the dataset OLD:

NEW <- filter(OLD, EXPRESSION)

For example, to find flights where the departure time is greater than 2300 (the times are in a 24 hour format):

flights_new <- filter(flights, dep_time > 2300)
flights_new
## # A tibble: 2,573 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     1     2302           2200        62     2342
##  2  2013     1     1     2306           2245        21       28
##  3  2013     1     1     2307           2245        22       32
##  4  2013     1     1     2310           2255        15       24
##  5  2013     1     1     2312           2000       192       21
##  6  2013     1     1     2323           2200        83       22
##  7  2013     1     1     2326           2130       116      131
##  8  2013     1     1     2327           2250        37       32
##  9  2013     1     1     2343           1724       379      314
## 10  2013     1     1     2353           2359        -6      425
## # ... with 2,563 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <dttm>

Notice that the new dataset has only 2573 rows, much smaller than the starting dataset. Similar expressions exist for other numeric comparisons: < (less than), >= (greater than or equal), and <= (less than or equal). Similarly we can compare whether one variable is exactly equal to a particular value. For this we need to use ==, not a single equal sign:

flights_exact <- filter(flights, dep_time == 2300)
flights_exact
## # A tibble: 64 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     5     2300           2229        31      352
##  2  2013     1     6     2300           2245        15       14
##  3  2013     1    13     2300           2039       141      357
##  4  2013     1    16     2300           2253         7       19
##  5  2013     1    30     2300           2030       150       23
##  6  2013    10     9     2300           2159        61       15
##  7  2013    10    18     2300           2245        15        7
##  8  2013    10    22     2300           2255         5       17
##  9  2013    11     8     2300           2245        15       16
## 10  2013    11    25     2300           2145        75      155
## # ... with 54 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <dttm>

Here we have flights that only take off exactly at 11pm. The symbol != detects whether a value is not equal to a particular value:

flights_not_december <- filter(flights, month != 12)
flights_not_december
## # A tibble: 300,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 300,316 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <dttm>

The == and != symbols also work for character and date variables, however you’ll need to make sure to enclose the comparison value (not the variable) in quotation marks:

flights_to_rva <- filter(flights, dest == "RIC")
flights_to_rva
## # A tibble: 2,346 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     1     1505           1310       115     1638
##  2  2013     1     1     1859           1900        -1     2012
##  3  2013     1     1     1938           1703       155     2109
##  4  2013     1     1     2115           2110         5     2240
##  5  2013     1     2      859            845        14     1035
##  6  2013     1     2     1352           1310        42     1509
##  7  2013     1     2     1525           1525         0     1723
##  8  2013     1     2     1810           1703        67     1948
##  9  2013     1     2     1940           1900        40     2052
## 10  2013     1     2     2119           2110         9     2227
## # ... with 2,336 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <dttm>

We can detect whether a variable is equal to a set of values using the %in% and c functions:

flights_summer <- filter(flights, month %in% c(7, 8, 9))
flights_dc_metro <- filter(flights, dest %in% c("DCA", "IAD", "BWI"))

These approaches here should get you through most of your needs in filtering datasets. Anything else can be gotten by making use of the & (and), | (or), and ! (negation). Do not worry about these now; if you have a need to use them on a project or lab I will show you then.

Graphing filtered data

If you want to just use the filtered data, this can be done straightforwardly in R by simply specifying the correct dataset in the first parameter of the ggplot command. But what if you want to use a subset of the data in only one plot?

Within geom layer we can override the data = option to use a different dataset than specified in the first line. I recommend putting this as the end of the geom layer:

ggplot(flights, aes(air_time, distance)) +
  geom_point() +
  geom_point(color = "red", data = flights_to_rva)

plot of chunk unnamed-chunk-9

This shows all of the Richmond flights in red on top of the remainder of the flights. Combined with annotations, these techniques can create very professional looking graphics.

Filtering dates

To filter dates and date time objects we can also use the numeric comparison operators such as > and <. However, we have to convert the thing we are comparing to a date object using either as.Date (for just date data) or as.POSIXct (for date time data).

For example, here is a way of filtering the flights dataset using the time_hour variable:

flights_post_oct <- filter(flights, time_hour >= as.POSIXct("2013-11-01"))
flights_post_oct
## # A tibble: 53,991 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013    11     1        5           2359         6      352
##  2  2013    11     1       35           2250       105      123
##  3  2013    11     1      455            500        -5      641
##  4  2013    11     1      539            545        -6      856
##  5  2013    11     1      542            545        -3      831
##  6  2013    11     1      549            600       -11      912
##  7  2013    11     1      550            600       -10      705
##  8  2013    11     1      554            600        -6      659
##  9  2013    11     1      554            600        -6      826
## 10  2013    11     1      554            600        -6      749
## # ... with 53,981 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <dttm>

The special function between allows us to grab a range of dates:

flights_xmas <- filter(flights, between(time_hour,
                                        as.POSIXct("2013-12-25"),
                                        as.POSIXct("2013-12-26")))
flights_xmas
## # A tibble: 721 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013    12    25      456            500        -4      649
##  2  2013    12    25      524            515         9      805
##  3  2013    12    25      542            540         2      832
##  4  2013    12    25      546            550        -4     1022
##  5  2013    12    25      556            600        -4      730
##  6  2013    12    25      557            600        -3      743
##  7  2013    12    25      557            600        -3      818
##  8  2013    12    25      559            600        -1      855
##  9  2013    12    25      559            600        -1      849
## 10  2013    12    25      600            600         0      850
## # ... with 711 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <dttm>

Or, similarly, as a date time object:

flights_xmas_night <- filter(flights, between(time_hour,
                                              as.POSIXct("2013-12-25 18:00"),
                                              as.POSIXct("2013-12-25 23:59")))
flights_xmas_night
## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <int>    <int>
## 1  2013    12    25     2315           2330       -15      348
## 2  2013    12    25     2352           2359        -7      429
## 3  2013    12    25     2353           2355        -2      442
## 4  2013    12    25     2357           2359        -2      446
## 5  2013    12    25     2357           2359        -2      433
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <int>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <int>, distance <int>, hour <int>, minute <int>,
## #   time_hour <dttm>

This will be useful if you want to subset the data you had from the first project.

Using ggmap

As a final note today, let’s look at the ggmap package for ploting spatial data. First, read in the dataset of airports (I’ll also filter out some airports in strange places):

airports <- read_csv("https://statsmaths.github.io/stat_data/f_airports.csv")
airports <- filter(airports, between(tz, -8, -5))
airports <- filter(airports, lat < 50)
airports
## # A tibble: 1,195 x 8
##    faa   name                   lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                <dbl>  <dbl> <int> <int> <chr> <chr>        
##  1 04G   Lansdowne Airport     41.1  -80.6  1044    -5 A     America/New_…
##  2 06A   Moton Field Municip…  32.5  -85.7   264    -6 A     America/Chic…
##  3 06C   Schaumburg Regional   42.0  -88.1   801    -6 A     America/Chic…
##  4 06N   Randall Airport       41.4  -74.4   523    -5 A     America/New_…
##  5 09J   Jekyll Island Airpo…  31.1  -81.4    11    -5 A     America/New_…
##  6 0A9   Elizabethton Munici…  36.4  -82.2  1593    -5 A     America/New_…
##  7 0G6   Williams County Air…  41.5  -84.5   730    -5 A     America/New_…
##  8 0G7   Finger Lakes Region…  42.9  -76.8   492    -5 A     America/New_…
##  9 0P2   Shoestring Aviation…  39.8  -76.6  1000    -5 U     America/New_…
## 10 0S9   Jefferson County In…  48.1 -123.    108    -8 A     America/Los_…
## # ... with 1,185 more rows

Now, if we want to plot the locations of all of the airports, read in the ggmap package:

library(ggmap)

To construct a map plot, use the following line in place of your usual call to ggplot:

qmplot(lon, lat, data = airports, geom="blank")

plot of chunk unnamed-chunk-15

Then, you can add points or text as usual:

qmplot(lon, lat, data = airports, geom="blank") +
  geom_point(color = "salmon", size = 0.5) 

plot of chunk unnamed-chunk-16

I really like spatial data and think the idea behind ggmap is great. However, I’ll admit that the package is not very well written or maintained, so your mileage may vary.