- Describe the two components of tabular data (observations and variables)
- Identify a comma seperated values (CSV) file
- Create tabular data in a spreadsheet and export to CSV
- Read a local CSV file into R with
Tabular data formats
In this course we will store data in a tabular format. These tables will have observations stored in rows and variables stored in columns. The individual elements are called values. So, each row represents a particular object in our dataset and each column represents some feature of the objects.
Let’s look at the births dataset again from last time:
The observations here are years and the variables are:
boy_to_girl_ratio. Each variable
measures something about a given observation. What exactly
constitutes a row of the data is called a unit of analysis.
Keeping in mind what the unit of analysis is will be very
important as we think about how data is being used.
Comma separated values
The type of R object that stores such a dataset is called a data frame. Data frames store tabular data for us within R. We also need a way to store tabular data as a file. One option would be to store data as Google sheets or Excel file. While these programs are great for data input, it is not a good idea to store raw data in these formats. Instead, we want a minimal plain text format. That is, the file should be readable by any basic text editor.
The plan text data format that we will use is called a comma-separated values or CSV file. In this format, each column of the data is (as the name suggests) separated by a comma. Every observation is stored on its own row. The first row gives the names of the columns. Here are what the first few rows of the births dataset look like stored as a CSV file:
head_of_state,year,boys,girls,total,boy_to_girl_ratio Charles I,1629,5218,4683,9901,1.114 Charles I,1630,4858,4457,9315,1.09 Charles I,1631,4422,4102,8524,1.078 Charles I,1632,4994,4590,9584,1.088 Charles I,1633,5158,4839,9997,1.066 Charles I,1634,5035,4820,9855,1.045
A CSV file can be created and read by R, Excel, GoogeSheets, OpenOffice, and most other data processing and statistical tools. It is one of the most common formats that you will see data stored as online. By convention, a CSV file has the extension “.csv”.
As we have seen, we can read a dataset using the
read_csv function. The
function either takes a URL, as we have above, or a path to the file on your
computer. We will test out the second example now.
Activity: Data creation
We are now going to collect some data as a class. Specifically, you will each record information about your six favorite restaurants:
- name of the restaurant
- favorite dish
- cost of a typical meal per person
- how many times you visit each year
- last time that you visited
Let’s start by doing this individually in Google Sheets (I’ll explain these steps in person). Once you are done, download the dataset as a CSV file.
Reading in a local file
Once you have downloaded the file, rename it to
place it wherever you save your RMarkdown files. Then, download the fourth
Save the lab in the same location and open the file in RStudio. We will work on the first few questions together.
Combining the data
It will be much more interesting if we can combine the data from everyone in the class. Use the links here to link into the collective Google Sheet and add your data to the appropriate sheet:
Once we are finished with the, you will then download the entire class’ data
as a single file. Download it and name the file
it into R as a dataset called
Depending on how the class goes, we may find that there are inconsistencies in how the data is formatted from student to student. Of most importance is that if any value in a variable does not look like a number the entire column will be considered as a categorical variable. With the time remaining we will try to adjust this.
Next week we will dig deeper in to the specific types of data in R and how the effect our graphics and analyses.
New variables (time remaining)
If we have time remaining, we will add additional variables to our dataset such as the latitude and longitude of the restaurant.