# Class 18: One Table Verbs

## Transforming Data

### Verbs

Today we are going to cover a set of functions that take a data frame as an input and return a new version of the data frame. These functions are called verbs and come from the dplyr package. If you are familiar with running database queries, note that all of these verbs map onto SQL commands. In fact, R can be set up so that dplyr is called over a database rather than a local data frame in memory.

There are over 30 verbs within dplyr, though most are either a minor variant or a specific application of another verb. Today we will see just five of them, which do the following:

• select a subset of rows from the original dataset (filter)
• rearrange the rows of the input (arrange)
• pick a subset of the variable from the original (select)
• create new variables (mutate)
• collapse rows into a single summary (summarize)

In the case of all verbs, the first argument is the original data frame and the output is a new data frame. It is important to note that verbs do not modify the original data; they operate on a copy of the original data.

To illustrate these verbs we will work with a dataset of the every commercial flight that departed from New York City in 2013.

### Filtering rows

The filter function takes a dataset and returns a subset of the rows of the original data. The first argument is the dataset and the remaining arguments are logical statements that filter the data frame. Only rows where the statements are true will be returned. Yes, we’ve already seen this.

Let’s grab only those flights that left after 11pm (2300):

### Arranging rows

Next, we will see how to reorder to rows in a data frame using the arrange function. Like all verbs it takes the data frame as its first argument. Other arguments specify which variables to sort by:

Other variables break ties in earlier variables:

To sort a variable in reverse order, wrap it in the desc function:

You can store a dataset after arranging it, but the most useful application is to print out the results in order to look at them.

### Selecting columns

With larger datasets (or when producing reports) it is sometimes useful to select just a subset of the columns in the original dataset. To do this, we use the select function. The first input is the data set, with other arguments being variables we want to look at:

I often now use this throughout our class notes so that you may more clearly see what is going on without the variables of interest (which tend to be on the end as we add them) being hidden.

### Adding and modifying variables

The mutate function creates new variable in our dataset as a function of other variables already present in the data. The function always add variables at the end of the dataset, so in order to see the results we will work with a smaller subset:

Lets calculate the average speed of the flight. The mutate function takes the data frame as its first argument followed by named arguments describing the new variables:

Similarly, we can figure out how much time was lost or gained between the departure delay and arrival delay:

Note that you can overwrite variables that already exist with mutate as well, though in general this should be avoided.

### Summarizing data

The summarize function collapses a data frame into a single row summary. We need to specify exactly what summaries are performed. Here, we will grab the mean values for arrival and departure delays:

There is also a special function called n() that summarizes the total number of rows:

Other summary functions that you might find useful:

• min
• max
• median
• sd - standard deviation
• quantile(x, 0.25) - generalization of median; here, a value that is greater than 25% of the data
• first, last, nth

Summarizing datasets does not seem particularly useful here as we have other ways of computing the means and counts of a dataset without using a new function. The real power of the summary function comes when we learn how to group datasets next class.

## Pipes

The pipe operator %>% is a relative newcomer within the R ecosystem. It is incredibly useful for writing readable code with dplyr and ggplot2. The pipe passes the output of one function to the first argument of the next function. Because ggplot and all of the dplyr verbs take a data frame as its first input we can pipe together a number of operations without saving the intermediate results.

For example, lets see the average change in delay between departure and arrival for flights leaving from JFK: We can also save the results of a long piped set of commands as a new dataset:

Notice the standard syntax of the piped commands: each line after the first is indented and we usually pipe the data itself as the first line. With a ggplot2 command, subsequent rows are indented twice.

## Resources

Here are several good resources if you want to learn more about the dplyr package:

Of course, you can also ask me any questions you may have!