Class 06: Tidy Data

Learning Objectives

• Identify the unit of observation of a data set
• Describe the ‘tidy data’ format
• Collect data as a structured tidy dataset
• Use left_join to combine linked tables in a tidy data schema

Unit of Observation

By now you have had several opportunities to practice building a structured tabular dataset. Today we will see one more extension of the methods we have so far developed. This change will allow us to accurately store and describe nearly any structured dataset. Before we get there, it will be useful to define a concept called the unit of observation.

In short, the unit of observation describes what type of element is described by a single row of the data. Here is a dataset that records specific commercial flights from one of the three NYC airports in 2013:

The unit of observation here is a flight. Likewise, here is a dataset describing information about specific planes:

The unit of observation of this dataset is an individual plane. Or, in this dataset:

The unit of observation is an airport.

Tidy data

Ideally, a data table should only contain information about its unit of observation. So, for example, the flights dataset above should not (and does not) contain metadata about the plane, destination airport, or arrival airport. It does indicate which plane was flying, where it started, and where it landed. These are okay because these values are about the flight.

A structured dataset that adheres to this principal is called “tidy data”. Benefits of this approach include:

• no repeated information
• internal consistency
• easy to modify

Take the flights dataset and the location of each airport. If these values were stored in the flight dataset the locations would be repeated each time there is a flight. This is cumbersome to enter and might lead to someone accidentally inputting the wrong value on some rows. Also, if we later determined that a particular airport’s exact coordinates have changed it would be difficult to change the values in the table. Instead, storing the location in its own table required only storing the value once and making it easy to modify and impossible to be inconsistent.

Course schedule dataset

Now, in groups, you will open a spreadsheet and collect data about the classes you are taking this semester. Include:

• name of the student
• student’s primary major (or intended primary major if undeclared)
• does the class fulfil major requirements for the student
• course number
• course name
• course meeting building abbreviation

The twist is that you need to devise a way to collect this data in a tidy format.

Combining tidy data

When you create a tidy dataset, typically the data is stored in multiple tables. In order to do analysis with the dataset we need a way of combining the tables back together.

The diagram below shows all of the connections between tables in the NYC Flight datasets (as well as two other tables I did not initially show):

Typically a relation maps a single row in one dataset to many rows in another. For example, each flight has one origin, but each origin has many flights. To combine tables that share a relation, we can use the function left_join in R.

Typically we put the larger table first and the smaller (metadata) table second.

Project I

You now have all the tools you need to take on Project I. See the main course page for instructions. We will work on the project in class during our next meeting.