Class 14: Summarizing Data

Learning Objectives

Understand the concepts behind summarizing datasets by a grouping variable and be able to apply this to a new dataset using the syntax in R.

NYC Flights Data

Once again we are going to work today with the NYC Flights dataset.

Take note of the unit of observation here: each row is a flight.

Changing the Unit of Observation

Often, it is useful to change the unit of observation within a dataset. The most common way of doing this is to group the dataset by a combination of variables and aggregate the numeric variables by taking sums, means, or some other summary statistics. Some common examples include:

• aggregating individual shot attempts in soccer to summary statistics about each player
• aggregating census tract data to a county or state level

We have seen a few simple ways of doing this already within a plot (such as counting occurances in a group with geom_bar). Today we will see how to do this with the group_summarize command.

Summarizing data

The group summarize command comes from the smodels package. Applying it to a dataset with no additional options yields a new dataset with just a single line. Variables in the new dataset summarize the numeric variables in the raw data.

Specifically, we see the following summaries for each numeric variable (the new names add a suffix to the original variable name):

• mean: the average of all the observations
• median: if we ordered all observations from smallest to largest, the middle value
• sd: the standard deviation, a measurment of how much the number varies across observations (more on this after the break)
• sum: the sum of all the observations

There is also a variable just called n at the end, giving the total number of observations in the entire dataset.

Group Summarize

The magic of the group_summarize command comes from specifying other variables in function call. If we specify a grouping variable, here I’ll use month, the summarizing will be done within each month. So, here, the new dataset has 12 rows with each row summarizing a given month:

This dataset can then be used in further visualizations. Such as:

Notice that it would be impossible to produce this graphic without the summarize command.

Summarize By Multiple Variables

By supplying multiple variables to the group_summarize command, we can group simultaneously by both. Here we have a unique row for each combination of carrier and departure airport:

Which allows us to make plots like this:

As you can imagine, summarizing data can quickly allow for very complex graphics and analyses.

Practice

We will work on Lab14 today in order to practice using the group_summarize command: lab14.Rmd Please upload your script to GitHub ahead of the next class.