Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options include=FALSE and message=FALSE to avoid cluttering the solutions with all the output from this code.

Collecting Data


Its a well-known aphorism within data science that the majority of an analysis consists in cleaning and validating our data. If we can collect our data in a clean format from the start, it will allow us to proceed directly to the exploration stage once the data have been collected.

There are number of excellent articles that give an extensive overview of how to collect and organize data. Hadley Wickham’s “Tidy Data”, one of the most citing papers across all of data science, offers an extensive theoretical framework for describing a process for collecting data sets. Karl Broman and Kara Woo’s “Data Organization in Spreadsheets”. offers a balanced between practical advice and an extended discussion of general principles for collecting data sets. The Data Carpentry guide Data Organization in Spreadsheets provides a concise list of common pitfalls.

This notebook provides a summarized set advice for organizing and storing data within a spreadsheet program. Rather than an extensive discussion of various pros and cons, it primarily focuses on the explicit approaches that I recommend. For readers interested in a broader coverage, I suggest reading the articles cited above. Because we are not using any fancy spreadsheet functions here, any program that you would like to use should be fine. The screenshots come from Microsoft Excel, but the same approach will work in Google Sheets, LibreOffice, or another spreadsheet program of your choosing.

Rectangular data

We have frequently discussed the concept of a rectangular data set, with observations in rows and variables in columns. This is the same format that we will use to collect our data. The first thing you will need to do, then, is determine what things you are observing and what properties you would like to collect about each thing. If you are observing different different kinds of things, each of which has a different set of associated properties, you may need to store each set in a different table.

To match the format of rectangular data that we have been working with in R, we need to structure our data set with a single row of column names, followed by a row of data for each observation. Here is an example from Excel of a nonsense data set with three variables: