Understand the concepts behind summarizing datasets by a grouping variable
and be able to apply this to a new dataset using the syntax in R.
NYC Flights Data
Once again we are going to work today with the NYC Flights dataset.
Take note of the unit of observation here: each row is a flight.
Changing the Unit of Observation
Often, it is useful to change the unit of observation within a dataset.
The most common way of doing this is to group the dataset by a combination
of variables and aggregate the numeric variables by taking sums, means, or
some other summary statistics. Some common examples include:
aggregating individual shot attempts in soccer to summary statistics about each player
aggregating census tract data to a county or state level
aggregating information about individual patients to summaries about demographic groups
We have seen a few simple ways of doing this already within a plot (such as
counting occurances in a group with geom_bar). Today we will see how to do this
with the group_summarize command.
The group summarize command comes from the smodels package. Applying it to a
dataset with no additional options yields a new dataset with just a single line.
Variables in the new dataset summarize the numeric variables in the raw data.
Specifically, we see the following summaries for each numeric variable (the new names add a suffix
to the original variable name):
mean: the average of all the observations
median: if we ordered all observations from smallest to largest, the middle value
sd: the standard deviation, a measurment of how much the number varies across observations (more on this after the break)
sum: the sum of all the observations
There is also a variable just called n at the end, giving the total number of observations in
the entire dataset.
The magic of the group_summarize command comes from specifying other variables in function call.
If we specify a grouping variable, here I’ll use month, the summarizing will be done within
each month. So, here, the new dataset has 12 rows with each row summarizing a given month:
This dataset can then be used in further visualizations. Such as:
Notice that it would be impossible to produce this graphic without the summarize command.
Summarize By Multiple Variables
By supplying multiple variables to the group_summarize command, we can group simultaneously by both.
Here we have a unique row for each combination of carrier and departure airport:
Which allows us to make plots like this:
As you can imagine, summarizing data can quickly allow for very complex
graphics and analyses.
We will work on Lab14 today in order to practice using the group_summarize
Please upload your script to GitHub ahead of the next class.