# Class 14: Summarizing Data

### Learning Objectives

Understand the concepts behind summarizing datasets by a grouping variable and be able to apply this to a new dataset using the syntax in R.

### NYC Flights Data

Once again we are going to work today with the NYC Flights dataset.

Take note of the unit of observation here: each row is a flight.

### Changing the Unit of Observation

Often, it is useful to change the unit of observation within a dataset. The most common way of doing this is to group the dataset by a combination of variables and aggregate the numeric variables by taking sums, means, or some other summary statistics. Some common examples include:

• aggregating individual shot attempts in soccer to summary statistics about each player
• aggregating census tract data to a county or state level
• aggregating information about individual patients to summaries about demographic groups

We have seen a few simple ways of doing this already within a plot (such as counting occurances in a group with geom_bar). Today we will see how to do this with the group_summarize command.

### Summarizing data

The group summarize command comes from the smodels package. Applying it to a dataset with no additional options yields a new dataset with just a single line. Variables in the new dataset summarize the numeric variables in the raw data.

Specifically, we see the following summaries for each numeric variable (the new names add a suffix to the original variable name):

• mean: the average of all the observations
• median: if we ordered all observations from smallest to largest, the middle value
• sd: the standard deviation, a measurment of how much the number varies across observations (more on this after the break)
• sum: the sum of all the observations

There is also a variable just called n at the end, giving the total number of observations in the entire dataset.

### Group Summarize

The magic of the group_summarize command comes from specifying other variables in function call. If we specify a grouping variable, here I’ll use month, the summarizing will be done within each month. So, here, the new dataset has 12 rows with each row summarizing a given month:

This dataset can then be used in further visualizations. Such as: Notice that it would be impossible to produce this graphic without the summarize command.

### Summarize By Multiple Variables

By supplying multiple variables to the group_summarize command, we can group simultaneously by both. Here we have a unique row for each combination of carrier and departure airport:

Which allows us to make plots like this: As you can imagine, summarizing data can quickly allow for very complex graphics and analyses.

### Practice

We will work on Lab14 today in order to practice using the group_summarize command: lab14.Rmd Please upload your script to GitHub ahead of the next class.