Today we are going to look at the NYC flights dataset once again:
Our object of study will be the function group_by, a dplyr
verb that seemingly does nothing (or very little) to a dataset.
Here, we group the flights data by month:
Other than a note in the output, the dataset is completely unchanged.
I like to think of the group_by function as putting a post-it note
on the dataset saying “treat unique combinations of the grouped variables
as their own data frames”. When passing the output of this function
to summarize, it will not return one summary row for each group.
We see that flights in June take off on average 20 minutes late, whereas
flights in November took off only a 5.4 minutes late.
We can group by multiple variables at once as well. Here we show the
average departure delay by airport and by month:
Notice that when grouping by multiple variables the summarize
function peels off the outer most layer of the grouping:
To remove all grouping, using the ungroup() function:
It is also possible to use grouped data with mutate and
filter. For example, here we return the 3 latest flights
for each day in the dataset:
Here, we see what proportion of the arrival delays is taken
up by each destination airport:
Yes, pipes with chained together grouped mutates and summaries
can get complex very quickly! Try to keep up with the basic ideas
and the more complicated examples will begin making sense soon.
When we print the flights dataset you should see that there are
short three or four letter abbreviations associated with each
variable. These describe the variable types of each column.
Specifically, we will see the following types in this course:
int for integers
dbl for doubles, or real numbers
chr for characters
dttm for date-times (a date + a time).
lgl for a logical value (TRUE or FALSE)
fctr for factors, a categorical variable from a fix dictionary of values
date for dates (without a time component)
We have already seen how to use logical statements to convert
numbers and characters into logical values. It is sometimes useful
to treat numeric data (int and dbl) as categorical data. Many
variables, such as month and day, are stored as numbers by can just
as easily be treated as discrete character values.
The factor function turns any variable into a factor
variable. We can use the mutate function to make a permanent
change, or apply it inline to effect the way a plot is built.
For example, compare this:
Alternatively, if a variable is truly continuous we can convert
it into a factor by using the cut function. This function splits
the range of the variable into evenly sized buckets and associates
each input with a specific value of the bucket. For example,
take the following example:
Of course, we would usually not color the points using the same
variable as one of the axes, but this helps illustrate what the
function is doing.
Later in the semester we will learn more functions specifically
for working with factors (from the forcats package) and
date-times (from the lubridate package).