In this notebook we introduce another set of methods for manipulating data sets. Table pivots, which are related by not identical to the spreadsheet concept of a pivot table, are a way or rearranging the values in a table without adding or lossing any additional information. This is achieved by either making the table longer (more rows, fewer columns) or wider (more columns, fewer rows).
What sort of situations would require going between two different
formats with a different number of rows? As an example, consider a
hypothetical experiment where we measure the heights of 100 plants every
morning for two weeks. There are two different ways to store this data.
We could have 100 rows, one for each plant, with features
height_day1
, height_day2
, and so on all the
way through height_day14
. Alternatively, we could have 1400
rows with just three columns: an id for the plant, a feature for the
day, and a feature for height. Notice that both of these options capture
the same information, but each privileges a particular kind of
analysis.
In the wider table format, it is straightforward to compute the amount that each plot grew over the two weeks using a single mutate function. In the longer table format, it would be straightforward to filter by a specific plant id and draw a line plot showing the growth of a specific plant over the two week period. Both drawing a plot with the wider table or computing the growth with the longer table are possible, but require a surprising amount of work and code.
In this notebook we will introduce two new functions for alternating between wider and longer formats for a data set. These are principles that will be fundamental to several applications, particularly with text and temporal data sets.
For these notes, we will make use of the food_prices
data set. As shown in our motivation example, pivoting is often a useful
operation to apply when analyzing data collected over time. The data set
is organized with year as the observation and each food type as a
column.
food_prices
## # A tibble: 146 × 14
## year tea sugar peanuts coffee cocoa wheat rye rice corn barley
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1870 129. 151. 203. 88.1 78.8 88.1 103. 83.5 121. 103.
## 2 1871 132. 167. 222. 109. 66.7 118. 105. 84.5 88.4 130.
## 3 1872 134. 162. 189. 140. 71.6 122. 102. 92.9 69.2 125.
## 4 1873 136. 154. 179. 173. 65.8 116. 106. 91.0 67.1 166.
## 5 1874 146. 153. 231. 187. 69.9 113. 126. 99.6 128. 174.
## 6 1875 149. 150. 197. 176. 69.4 110. 116. 85.8 127. 161.
## 7 1876 150. 160. 172. 184. 80.7 114. 106. 95.3 91.2 132.
## 8 1877 149. 189. 153. 198. 87.8 144. 97.0 108. 94.5 125.
## 9 1878 150. 165. 160. 169. 96.0 115. 91.6 114. 82.2 121.
## 10 1879 144. 158. 133. 149. 108. 118. 113. 110. 78.7 124.
## # … with 136 more rows, and 3 more variables: pork <dbl>, beef <dbl>,
## # lamb <dbl>
This format makes it straightforward to compute the correlation between the prices of different kinds of food items. A longer format for the data set would, instead, have one row for each combination of year and food time.
In order to make this table longer, we will apply the
pivot_longer
function. This function requires knowing which
current features in the data set should be turned into values in the
output data set. Often, it is easier to describe the set of values that
will not be turned into values. Here, we indicate (with a minus
sign) that the year value should remain as a feature in the output data
set:
%>%
food_prices pivot_longer(-c(year))
## # A tibble: 1,898 × 3
## year name value
## <dbl> <chr> <dbl>
## 1 1870 tea 129.
## 2 1870 sugar 151.
## 3 1870 peanuts 203.
## 4 1870 coffee 88.1
## 5 1870 cocoa 78.8
## 6 1870 wheat 88.1
## 7 1870 rye 103.
## 8 1870 rice 83.5
## 9 1870 corn 121.
## 10 1870 barley 103.
## # … with 1,888 more rows
Already this looks close to what a long form of the food prices data
set should look like. One improvement that we can make is to set better
column names, which can be done by setting the options
names_to
and values_to
in the function
call:
%>%
food_prices pivot_longer(-c(year), names_to = "food", values_to = "price")
## # A tibble: 1,898 × 3
## year food price
## <dbl> <chr> <dbl>
## 1 1870 tea 129.
## 2 1870 sugar 151.
## 3 1870 peanuts 203.
## 4 1870 coffee 88.1
## 5 1870 cocoa 78.8
## 6 1870 wheat 88.1
## 7 1870 rye 103.
## 8 1870 rice 83.5
## 9 1870 corn 121.
## 10 1870 barley 103.
## # … with 1,888 more rows
The longer form of the data set makes it much easier to do some kinds of analysis. For example, we can draw a line chart of all of the food prices with a single graphics layer:
%>%
food_prices pivot_longer(-year, names_to = "food", values_to = "price") %>%
ggplot() +
geom_line(aes(x = year, y = price, color = food))
Drawing this plot with the original data set would require manually including a layer for each food type, selecting their colors, and building a manual legend. The alternative using the longer table is certainly the preferred approach.
To illustrate making a table wider, let’s create a new data set consisting of the long format of the food prices data set from just the years 1950 and 1975:
<- food_prices %>%
food_prices_long pivot_longer(-year, names_to = "food", values_to = "price") %>%
filter(year %in% c(1950, 1975))
As described in our motivating example, it makes sense for some
analyses to make each time point a column in a wider data set. To do
this, we use the pivot_wider
function. We need to indicate
the current feature contains the values that will become new columns and
the feature from which to take the values for the new column from. Here,
the names will come from the years
column (we want a new
column for 1950 and another one for 1975) and the values will be filled
in with prices.
%>%
food_prices_long pivot_wider(names_from = year, values_from = price)
## # A tibble: 13 × 3
## food `1950` `1975`
## <chr> <dbl> <dbl>
## 1 tea 68.9 54.1
## 2 sugar 59.5 114.
## 3 peanuts 119. 95.6
## 4 coffee 210. 105.
## 5 cocoa 64.6 54.1
## 6 wheat 106. 74.8
## 7 rye 88.4 71.2
## 8 rice 62.7 74.2
## 9 corn 135. 113.
## 10 barley 98.8 91.3
## 11 pork 52.3 88.8
## 12 beef 62.4 268.
## 13 lamb 33.8 186.
One problem with the default output is that the column names now
start with a number, which is not an allowed variable name in R. This
makes it awkward to work with the data set; it is better to add a prefix
to the names to make them valid. This can be done by setting the
names_prefix
option in the pivot_wider
function.
%>%
food_prices_long pivot_wider(
names_from = year, values_from = price, names_prefix = "year_"
)
## # A tibble: 13 × 3
## food year_1950 year_1975
## <chr> <dbl> <dbl>
## 1 tea 68.9 54.1
## 2 sugar 59.5 114.
## 3 peanuts 119. 95.6
## 4 coffee 210. 105.
## 5 cocoa 64.6 54.1
## 6 wheat 106. 74.8
## 7 rye 88.4 71.2
## 8 rice 62.7 74.2
## 9 corn 135. 113.
## 10 barley 98.8 91.3
## 11 pork 52.3 88.8
## 12 beef 62.4 268.
## 13 lamb 33.8 186.
This new form of the data set makes it straightforward to plot the price of each food type in 1975 as a function of its price in 1950.
%>%
food_prices_long pivot_wider(names_from = year, values_from = price, names_prefix = "year_") %>%
ggplot() +
geom_point(aes(x = year_1950, y = year_1975)) +
geom_text_repel(aes(x = year_1950, y = year_1975, label = food))
We can add some of the polishing touches mentioned in Notebook07 to make the plot even more readable.
%>%
food_prices_long pivot_wider(names_from = year, values_from = price, names_prefix = "year_") %>%
mutate(food = stri_trans_totitle(food)) %>%
ggplot() +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
geom_point(aes(x = year_1950, y = year_1975)) +
geom_text_repel(aes(x = year_1950, y = year_1975, label = food)) +
labs(x = "Price Index (1950)", y = "Price Index (1975)")
In this new plot, can you see what products got much more expensive, much less expensive, and stayed about the same from 1950 to 1975?
The syntax for making tables wider or longer is, on the surface, not much more complex than other table verbs that we have covered in this text. The biggest challenges with table pivots are identifying when they will simplify an analysis and not over-using them. The best way to avoid these issues is to store your data in the longest format that makes sense for your data. For example, in the motivating example about plant growth, it is better to store the data with 1400 rows and 3 columns.
Storing data in a longer format has a number of benefits. Reducing the number of columns makes it easier to document the (smaller set of) features with a well-written data dictionary. Also, while avoided in our simple examples within this notebook, pivoting wider also often requires less code and results in fewer bugs. Several of these are illustrated in the practice exercises.
Perhaps the biggest benefit of storing data in a longer format is to avoid the potentially complex chain of operations required to make the plot at the end of the previous section. The original data set is stored with years as rows and items as columns. Producing the plot requires thinking of years and columns and items as rows; this needed us to first pivot longer and then pivot wider. Keeping data in a longer format avoids the need for double pivots, while also making the different kinds of analysis (item and year, year by item, item by year) all reasonable accessible.
In my experience, some social scientists have a bad habit of storing data in extremely wide and unwieldy formats. It seems to be something that comes from the way that feels natural to organize survey and time-series data sets. For example, the U.S. Census Bureau produces data sets that often have hundreds of columns. Keep this in mind as you collect your own data, or work with external sources. Often an analysis that looks difficult at first will appear quite straightforward after pivoting your data set to a longer format.
Consider the following subset of the hans
dataset:
## # A tibble: 8 × 3
## country year gdp
## <chr> <dbl> <dbl>
## 1 China 1957 576.
## 2 China 2007 4959.
## 3 France 1957 8663.
## 4 France 2007 30470.
## 5 Mexico 1957 4132.
## 6 Mexico 2007 11978.
## 7 United States 1957 14847.
## 8 United States 2007 42952.
Answer the following four questions:
## # A tibble: 2 × 5
## year China France Mexico `United States`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1957 576. 8663. 4132. 14847.
## 2 2007 4959. 30470. 11978. 42952.
## # A tibble: 4 × 3
## country year_1957 year_2007
## <chr> <dbl> <dbl>
## 1 China 576. 4959.
## 2 France 8663. 30470.
## 3 Mexico 4132. 11978.
## 4 United States 14847. 42952.