Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options include=FALSE and message=FALSE to avoid cluttering the solutions with all the output from this code.

Restructuring Data

Motivation

In this notebook we introduce another set of methods for manipulating data sets. Table pivots, which are related by not identical to the spreadsheet concept of a pivot table, are a way or rearranging the values in a table without adding or lossing any additional information. This is achieved by either making the table longer (more rows, fewer columns) or wider (more columns, fewer rows).

What sort of situations would require going between two different formats with a different number of rows? As an example, consider a hypothetical experiment where we measure the heights of 100 plants every morning for two weeks. There are two different ways to store this data. We could have 100 rows, one for each plant, with variables height_day1, height_day2, and so on all the way through height_day14. Alternatively, we could have 1400 rows with just three columns: an id for the plant, a variable for the day, and a variable for height. Notice that both of these options capture the same information, but each privileges a particular kind of analysis.

In the wider table format, it is straightforward to compute the amount that each plot grew over the two weeks using a single mutate function. In the longer table format, it would be straightforward to filter by a specific plant id and draw a line plot showing the growth of a specific plant over the two week period. Both drawing a plot with the wider table or computing the growth with the longer table are possible, but require a surprising amount of work and code.

In this notebook we will introduce two new functions for alternating between wider and longer formats for a data set. These are principles that will be fundamental to several applications, particularly with text and temporal data sets.

Pivot wider

For these notes, we will make use of the food_prices data set. As shown in our motivation example, pivoting is often a useful operation to apply when analyzing data collected over time. The data set is organized with year as the observation and each food type as a column.

food_prices
## # A tibble: 146 x 14
##     year   tea sugar peanuts coffee cocoa wheat   rye  rice  corn barley  pork
##    <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
##  1  1870  129.  151.    203.   88.1  78.8  88.1 103.   83.5 121.    103. 108. 
##  2  1871  132.  167.    222.  109.   66.7 118.  105.   84.5  88.4   130.  68.5
##  3  1872  134.  162.    189.  140.   71.6 122.  102.   92.9  69.2   125.  58.6
##  4  1873  136.  154.    179.  173.   65.8 116.  106.   91.0  67.1   166.  62.9
##  5  1874  146.  153.    231.  187.   69.9 113.  126.   99.6 128.    174.  87.6
##  6  1875  149.  150.    197.  176.   69.4 110.  116.   85.8 127.    161. 113. 
##  7  1876  150.  160.    172.  184.   80.7 114.  106.   95.3  91.2   132. 106. 
##  8  1877  149.  189.    153.  198.   87.8 144.   97.0 108.   94.5   125.  84.0
##  9  1878  150.  165.    160.  169.   96.0 115.   91.6 114.   82.2   121.  60.3
## 10  1879  144.  158.    133.  149.  108.  118.  113.  110.   78.7   124.  61.1
## # … with 136 more rows, and 2 more variables: beef <dbl>, lamb <dbl>

This format makes it straightforward to compute the correlation between the prices of different kinds of food items. A longer format for the data set would, instead, have one row for each combination of year and food time.

In order to make this table longer, we will apply the pivot_longer function. This function requires knowing which current variables in the data set should be turned into values in the output data set. Often, it is easier to describe the set of values that will not be turned into values. Here, we indicate (with a minus sign) that the year value should remain as a variable in the output data set:

food_prices %>%
  pivot_longer(-c(year))
## # A tibble: 1,898 x 3
##     year name    value
##    <dbl> <chr>   <dbl>
##  1  1870 tea     129. 
##  2  1870 sugar   151. 
##  3  1870 peanuts 203. 
##  4  1870 coffee   88.1
##  5  1870 cocoa    78.8
##  6  1870 wheat    88.1
##  7  1870 rye     103. 
##  8  1870 rice     83.5
##  9  1870 corn    121. 
## 10  1870 barley  103. 
## # … with 1,888 more rows

Already this looks close to what a long form of the food prices data set should look like. One improvement that we can make is to set better column names, which can be done by setting the options names_to and values_to in the function call:

food_prices %>%
  pivot_longer(-c(year), names_to = "food", values_to = "price")
## # A tibble: 1,898 x 3
##     year food    price
##    <dbl> <chr>   <dbl>
##  1  1870 tea     129. 
##  2  1870 sugar   151. 
##  3  1870 peanuts 203. 
##  4  1870 coffee   88.1
##  5  1870 cocoa    78.8
##  6  1870 wheat    88.1
##  7  1870 rye     103. 
##  8  1870 rice     83.5
##  9  1870 corn    121. 
## 10  1870 barley  103. 
## # … with 1,888 more rows

The longer form of the data set makes it much easier to do some kinds of analysis. For example, we can draw a line chart of all of the food prices with a single graphics layer:

food_prices %>%
  pivot_longer(-year, names_to = "food", values_to = "price") %>%
  ggplot() +
    geom_line(aes(x = year, y = price, color = food))