### Learning Objectives

Understand the concepts behind summarizing datasets by a grouping variable and be able to apply this to a new dataset using the syntax in R.

### NYC Flights Data

Once again we are going to work today with the NYC Flights dataset.

Take note of the unit of observation here: each row is a flight.

### Changing the Unit of Observation

Often, it is useful to change the unit of observation within a dataset. The most common way of doing this is to group the dataset by a combination of variables and aggregate the numeric variables by taking sums, means, or some other summary statistics. Some common examples include:

- aggregating individual shot attempts in soccer to summary statistics about each player
- aggregating census tract data to a county or state level
- aggregating information about individual patients to summaries about demographic groups

We have seen a few simple ways of doing this already within a plot (such as
counting occurances in a group with `geom_bar`

). Today we will see how to do this
with the `group_summarize`

command.

### Summarizing data

The group summarize command comes from the **smodels** package. Applying it to a
dataset with no additional options yields a new dataset with just a single line.
Variables in the new dataset summarize the numeric variables in the raw data.

Specifically, we see the following summaries for each numeric variable (the new names add a suffix to the original variable name):

`mean`

: the average of all the observations`median`

: if we ordered all observations from smallest to largest, the middle value`sd`

: the standard deviation, a measurment of how much the number varies across observations (more on this after the break)`sum`

: the sum of all the observations

There is also a variable just called `n`

at the end, giving the total number of observations in
the entire dataset.

### Group Summarize

The magic of the `group_summarize`

command comes from specifying other variables in function call.
If we specify a grouping variable, here I’ll use `month`

, the summarizing will be done *within*
each month. So, here, the new dataset has 12 rows with each row summarizing a given month:

This dataset can then be used in further visualizations. Such as:

Notice that it would be impossible to produce this graphic without the summarize command.

### Summarize By Multiple Variables

By supplying multiple variables to the `group_summarize`

command, we can group simultaneously by both.
Here we have a unique row for each combination of carrier and departure airport:

Which allows us to make plots like this:

As you can imagine, summarizing data can quickly allow for very complex graphics and analyses.

### Practice

We will work on Lab14 today in order to practice using the `group_summarize`

command:
lab14.Rmd
Please upload your script to GitHub ahead of the next class.