# Class 19: Collecting Relational Data

In order to let you get started on the final project, I have reordered the typical order of my notes on data manipulation. Today we are going to see how to collect and manipulate data stored in multiple tables.

## Three principles

We have already discussed at length how to store tablular datasets and some good practices for naming variables. As datasets become more complex, there are some additional concerns that arise. All of the principles of constructing a dataset (equivalently, a database) could easily fill a whole course. Here are three principles that get us on the right track:

• determine the objects of study; each of these gets its own table, and each example gets its own row; movies, actors, actor-movie links
• information about an object of study should be put on the relevant table; the information should remain internally consistent regardless of changes made within a data table
• each column should be indivisible and the variable type clear; for example, budget should not include the dollar sign, if needed create a new column; name columns with no spaces or special characters

If you would like to know about specifics of these principals I highly recommend the Hadley Wickhams “Tidy Data” paper:

The Tidy Data chapter of R for Data Science covers several useful functions the show to convert a dataset that has already been collected into a tidy format:

To better explain these principles, let’s look at several tables from the airline dataset.

## Airline data

Today we will once again look at the NYC flights dataset:

This time we will also look at another table that describes each of the airlines.

Another describing the planes:

The airports:

And the weather:

Each of these tables captures a specific unit of observation. Information about a unit of observation is only included in a specific table. Note that, for example, including the origin airport in the flights table is not an error: this is information about the flight. However, if we put in information about the timezone of the flight, this would break the tidy data principles.

## Two table verbs

Having data places across different tables is useful, but often we need to put the data back together before using it. Notice that the various tables are connected by way of common shared variables, known in database theory as keys. A key may be a single variable or a collection of variables (a composite key).

In order to combine them, by matching up along keys, we can make use of the function left_join. Here we see that it returns a new table with the same number of rows as the flights dataset but now with the full name of the carrier. I’ll make use of a pipe to reduce the number of variables to make it easier to see:

The resulting dataset combines all of the variables by the common key.

### Composite keys

Recall that a key may consist of multiple variables. To join on many variables at once, provide a vector of all the names to the option by:

### Common variables

Sometimes there may be a variable name present in two datasets that we want to merge together but that has a different meaning in each dataset. For example, year is the year of the flight in the flights dataset but the year of creating in the airplanes dataset. If we join these two, we see that a suffix is added to each variable:

This behavior is fine, but it is better to manually describe what the suffix should be:

### Unspecified key

If we do not specify the key to join with in the by option, dplyr will assume that we want to join on all common keys. A warning will be produced warning which variables were choosen. This can be useful in a pinch, but generally is a bad idea to rely on.

### Practice (or not)

In the interest of time, we will practice the left_join function on the next lab. We will spend the end of class today discussing the final project.