In these notes we are going to cover a set of functions that take a data frame as an input and return a new version of the data frame. These functions are called verbs and come from the dplyr package. If you are familiar with running database queries, note that all of these verbs map onto SQL commands. In fact, R can be set up so that dplyr is called over a database rather than a local data frame in memory.
There are over 40 verbs in the dplyr package, though most are a minor variant or specific application of another verb. In this notebook we will see only four of them:
filter
and slice
)select
)arrange
)In all verb functions, the first argument is the original data frame
and the output is a new data frame. Here, we will also see the functions
between
and %in%
to assist with the filtering
command and desc
to assist with arranging the rows of a
data set.
Note that verbs do not modify the original data; they operate on a copy of the original data. We have to make an explicit name for the new data set if we want to save it for use elsewhere.
It is often useful to take a subset of the rows of an existing data
set, for example if you want to build a model on a certain subpopulation
or highlight a particular part of the data in a plot. Perhaps the most
straightforward way to take a subset of rows is to indicate the specific
row numbers that we want to extract. In order to select rows by row
numbers, we use the verb slice
, followed by the numbers of
the rows we want separated by commas. Here is an example taking the
second, fifth, and seventh rows of the data:
%>%
food slice(2, 5, 7)
## # A tibble: 3 × 17
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Asparagus vegeta… 20 0.1 0.046 0 2 3.88 2.1 1.88
## 2 Chickpea grains 180 2.9 0.309 0 243 30.0 8.6 5.29
## 3 Beef meat 288 19.5 7.73 87 384 0 0 0
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## # vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## # abbreviated variable names ¹food_group, ²calories, ³total_fat,
## # ⁴cholesterol
As mentioned above, the code here does not change the data set
food
itself. It still has all 61 rows of food contained in
it. If we want to create a new data set with just these three food item,
we need to explicitly name and assign it. For example, here is how we
would create a data set of the first five food items named
food_first_five
:
<- food %>%
food_first_five slice(1, 2, 3, 4, 5)
There is a convenient a shorthand for selecting a range of row numbers, for example every row from the tenth to the twentieth, by indicating the starting and ending row number by a colon. Here, for example, is another way to select the first five rows of the data set:
%>%
food slice(1:5)
## # A tibble: 5 × 17
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Apple fruit 52 0.1 0.028 0 1 13.8 2.4 10.4
## 2 Asparagus vegeta… 20 0.1 0.046 0 2 3.88 2.1 1.88
## 3 Avocado fruit 160 14.6 2.13 0 7 8.53 6.7 0.66
## 4 Banana fruit 89 0.3 0.112 0 1 22.8 2.6 12.2
## 5 Chickpea grains 180 2.9 0.309 0 243 30.0 8.6 5.29
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## # vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## # abbreviated variable names ¹food_group, ²calories, ³total_fat,
## # ⁴cholesterol
Another way to take a subset of our data is to select rows based on
conditions about the features in the data set. To do this we use the
filter
function, which accepts a statement about features
in the data set. Only rows where the statements are true will be
returned. For example, here is how we use the filter
command to select the foods that have more than 150 calories grams of
sugar in each serving:
%>%
food filter(calories > 150)
## # A tibble: 20 × 17
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Avocado fruit 160 14.6 2.13 0 7 8.53 6.7 0.66
## 2 Chickpea grains 180 2.9 0.309 0 243 30.0 8.6 5.29
## 3 Beef meat 288 19.5 7.73 87 384 0 0 0
## 4 Catfish fish 240 14.5 3.25 69 398 8.54 0.5 0.85
## 5 Cheese dairy 350 26.9 16.6 83 955 4.71 0 3.54
## 6 Chicken meat 237 13.4 3.76 87 404 0 0 0
## 7 Clam fish 180 8 1.60 56 400 11.1 0.5 0.85
## 8 Cod fish 211 10.8 2.22 57 401 8.25 0.5 0.82
## 9 Halibut fish 239 17.7 3.10 59 103 0 0 0
## 10 Lamb meat 292 20.7 8.76 96 394 0 0 0
## 11 Oat grains 389 6 1.22 0 2 66.3 10.6 0
## 12 Oyster fish 160 7.9 1.85 57 595 12.5 0.5 0.71
## 13 Penne grains 157 0.9 0.175 0 233 30.7 1.8 0.56
## 14 Pork meat 271 17 6.17 90 384 0 0 0
## 15 Salmon fish 171 7.5 1.31 62 467 0.49 0 0.11
## 16 Scallop fish 217 10.9 2.22 54 487 10.5 0.5 0.82
## 17 Sour Cr… dairy 214 20.9 13.0 44 53 4.27 0 0.16
## 18 Swordfi… fish 177 8.2 1.96 47 494 0.49 0 0.11
## 19 Tuna fish 153 3.9 0.811 53 366 0.41 0 0.09
## 20 Turkey meat 187 7 2.00 77 69 0 0 0
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## # vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## # abbreviated variable names ¹food_group, ²calories, ³total_fat,
## # ⁴cholesterol
The output data set has only 20 rows, compared to the 62 in the
original data. Other comparisons can be done with <
,
>=
and <=
. There is also a special
function called between
that is often useful. For example,
here are the rows that have between 2 and 3 grams of total fat:
%>%
food filter(between(total_fat, 2, 3))
## # A tibble: 4 × 17
## item food_g…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Chickpea grains 180 2.9 0.309 0 243 30.0 8.6 5.29
## 2 Quinoa grains 143 2.2 0.226 0 196 26.4 2.3 0
## 3 Shrimp fish 144 2.3 0.446 206 613 1.24 0 0
## 4 Potato vegetab… 104 2 0.458 0 254 19.4 1.7 0.82
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## # vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## # abbreviated variable names ¹food_group, ²calories, ³total_fat,
## # ⁴cholesterol
If you want to filter on a categorical feature, you can use the
%in%
operator to select specific categories. Here is the
code to filter only the fish and vegetable food groups:
%>%
food filter(food_group %in% c("fish", "vegetable"))
## # A tibble: 30 × 17
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Asparag… vegeta… 20 0.1 0.046 0 2 3.88 2.1 1.88
## 2 String … vegeta… 31 0.1 0.026 0 6 7.13 3.4 1.4
## 3 Bell Pe… vegeta… 26 0 0.059 0 2 6.03 2 4.2
## 4 Crab fish 87 1 0.222 78 293 0.04 0 0
## 5 Broccoli vegeta… 34 0.3 0.039 0 33 6.64 2.6 1.7
## 6 Cabbage vegeta… 24 0.1 0.016 0 18 5.58 2.3 3.58
## 7 Carrot vegeta… 41 0.2 0.037 0 69 9.58 2.8 4.54
## 8 Catfish fish 240 14.5 3.25 69 398 8.54 0.5 0.85
## 9 Caulifl… vegeta… 25 0 0.032 0 30 5.3 2.5 2.4
## 10 Celery vegeta… 14 0.1 0.043 0 80 2.97 1.6 1.83
## # … with 20 more rows, 7 more variables: protein <dbl>, iron <dbl>,
## # vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## # color <chr>, and abbreviated variable names ¹food_group, ²calories,
## # ³total_fat, ⁴cholesterol
As with the other verbs, we can chain together multiple calls to produce more complex logic. For example, this code selects fruits that have more than 150 calories per serving:
%>%
food filter(calories > 150) %>%
filter(food_group %in% c("fruit"))
## # A tibble: 1 × 17
## item food_gr…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Avocado fruit 160 14.6 2.13 0 7 8.53 6.7 0.66
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## # vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## # abbreviated variable names ¹food_group, ²calories, ³total_fat,
## # ⁴cholesterol
Which results in a reduced data set with only 1 row (avocados). You
can also use ==
to test equality
(food_group == "fruit"
) or !=
to test whether
a feature is not equal to a specific value.
It is also possible to create a chain of calls that then get piped
into a call to the ggplot
function. For example, here is a
plot of the fruits and vegetables with the Avocado outlier removed (by
limiting the maximum available total fat).
%>%
food filter(food_group %in% c("vegetable", "fruit")) %>%
filter(total_fat < 10) %>%
ggplot() +
geom_point(aes(x = calories, y = total_fat, color = food_group)) +
geom_text_repel(aes(x = calories, y = total_fat, label = item)) +
scale_color_viridis_d()
The pattern of a starting with a data set, applying a number of transformations, and then creating a visualization of the data will become a common pattern in our analyses.
Now that we know how to create a subset of our data, let’s use this new knowledge to build some interesting data visualizations. To start, create a data set that just consists of the food types that are in the meat food group:
<- filter(food, food_group %in% c("meat"))
food_meat food_meat
## # A tibble: 6 × 17
## item food_gr…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Beef meat 288 19.5 7.73 87 384 0 0 0
## 2 Chicken meat 237 13.4 3.76 87 404 0 0 0
## 3 Duck meat 132 5.9 2.32 77 74 0 0 0
## 4 Lamb meat 292 20.7 8.76 96 394 0 0 0
## 5 Pork meat 271 17 6.17 90 384 0 0 0
## 6 Turkey meat 187 7 2.00 77 69 0 0 0
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## # vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## # abbreviated variable names ¹food_group, ²calories, ³total_fat,
## # ⁴cholesterol
One of the core ideas behind the Grammar of Graphics is that complex
visualizations can be constructed by layering relatively simply elements
on top of one another. What if we wanted to put together two layers
where one layer uses the food
data set and the other uses
food_meat
? To do this, we can override the default
data set in a layer with the option data =
. This will use a
different data set within a particular layer. For example, here is how
we can layer the meat data set on top of the rest of the food items.
%>%
food ggplot() +
geom_point(aes(x = calories, y = total_fat)) +
geom_point(aes(x = calories, y = total_fat), data = food_meat)
This plot, however, does not look any different than it would if we were just to plot all of the food together. The second layer of points just sits unassumingly on top of the rest of the data. To rectify this, we can color each layer a different color in order to distinguish them from one another. Let’s try to highlight the meat food group in a navy blue, while making the rest of the points a light grey:
%>%
food ggplot() +
geom_point(aes(x = calories, y = total_fat), color = "grey85") +
geom_point(aes(x = calories, y = total_fat), color = "navy", data = food_meat)
We now have a plot that shows exactly where the meats are relative to the other food items. We can further build up the plot by showing the names of just these rows of the dataset as well:
%>%
food ggplot() +
geom_point(aes(x = calories, y = total_fat), color = "grey85") +
geom_point(aes(x = calories, y = total_fat), color = "navy", data = food_meat) +
geom_text_repel(
aes(x = calories, y = total_fat, label = item),
color = "navy",
data = food_meat
)
Notice that the code is starting to get a bit more complicated and some of the graphic layers are becoming a bit long. This is a good place to use the shorthand notation to inherit aesthetics across layers, like this:
%>%
food ggplot(aes(calories, total_fat)) +
geom_point(color = "grey85") +
geom_point(color = "navy", data = food_meat) +
geom_text_repel(aes(label = item), color = "navy", data = food_meat)
Notice how a relatively small set of commands can be put together in different ways to build a variety of plots. Already, we are making further progress towards building informative and beautiful graphics in R!
It is also possible to take a subset of the columns in a data set. To
do this, we make use of the verb select
. We pass it the
names of the features we want to keep in the output data set, in the
(possibly new) order that we want the columns to be arranged in. Here,
for example, is a new version of the foods data set containing only the
food item name followed by the amount of Vitamin A and Vitamin C:
%>%
food select(item, vitamin_a, vitamin_c)
## # A tibble: 61 × 3
## item vitamin_a vitamin_c
## <chr> <dbl> <dbl>
## 1 Apple 1 8
## 2 Asparagus 15 9
## 3 Avocado 3 17
## 4 Banana 1 15
## 5 Chickpea 0 3
## 6 String Bean 14 27
## 7 Beef 0 0
## 8 Bell Pepper 63 317
## 9 Crab 0 5
## 10 Broccoli 12 149
## # … with 51 more rows
We will not need to use the select
verb as often as
filter
because for the most part having extra features
around does not effect data visualizations or data models. However, it
can be useful to displaying results. As we saw above, the Vitamin A and
Vitamin C columns were cut-off in the original output but are not
visible in the selected data set version. Removing and reordering
unneeded columns will also be useful in some of the advanced
applications that are discussed in the following classes.
The verbs slice
and filter
determine a
subset of rows to keep from the original data set. The
arrange
verb, in contrast, keeps all of the original data
but re-orders its rows. Specifically, we give it one or more feature
names and it sorts the data by the first feature from smallest to
largest (or alphabetically for character feature). In the case of ties,
the second feature is used if given. More features can be given to
further break additional ties. Here is an example where we order the
data set first by food_group
and then by
calories
:
%>%
food arrange(food_group, calories)
## # A tibble: 61 × 17
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Milk dairy 60 3.2 1.86 10 40 4.52 0 5.26
## 2 Yogurt dairy 99 1.1 0.742 5 53 18.6 0 18.6
## 3 Sour Cr… dairy 214 20.9 13.0 44 53 4.27 0 0.16
## 4 Cheese dairy 350 26.9 16.6 83 955 4.71 0 3.54
## 5 Crab fish 87 1 0.222 78 293 0.04 0 0
## 6 Lobster fish 97 0.5 0.106 71 700 1.27 0 0
## 7 Haddock fish 128 3.7 0.69 66 402 0.48 0.1 0.11
## 8 Flounder fish 133 4.2 0.867 56 417 0.41 0 0.09
## 9 Shrimp fish 144 2.3 0.446 206 613 1.24 0 0
## 10 Tuna fish 153 3.9 0.811 53 366 0.41 0 0.09
## # … with 51 more rows, 7 more variables: protein <dbl>, iron <dbl>,
## # vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## # color <chr>, and abbreviated variable names ¹food_group, ²calories,
## # ³total_fat, ⁴cholesterol
In the new data set all of the dairy products come up first followed by the fish products. Within each group, the items are sorted from the lowest to highest number of calories.
The ordering can be reversed (i.e., from the highest to the lowest
value) be wrapping a feature name in the function desc()
,
such as this ordering from the most saturated fat to the least:
%>%
food arrange(desc(sat_fat))
## # A tibble: 61 × 17
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cheese dairy 350 26.9 16.6 83 955 4.71 0 3.54
## 2 Sour Cr… dairy 214 20.9 13.0 44 53 4.27 0 0.16
## 3 Lamb meat 292 20.7 8.76 96 394 0 0 0
## 4 Beef meat 288 19.5 7.73 87 384 0 0 0
## 5 Pork meat 271 17 6.17 90 384 0 0 0
## 6 Chicken meat 237 13.4 3.76 87 404 0 0 0
## 7 Catfish fish 240 14.5 3.25 69 398 8.54 0.5 0.85
## 8 Halibut fish 239 17.7 3.10 59 103 0 0 0
## 9 Duck meat 132 5.9 2.32 77 74 0 0 0
## 10 Cod fish 211 10.8 2.22 57 401 8.25 0.5 0.82
## # … with 51 more rows, 7 more variables: protein <dbl>, iron <dbl>,
## # vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## # color <chr>, and abbreviated variable names ¹food_group, ²calories,
## # ³total_fat, ⁴cholesterol
In the result here, “Cheese” has been placed at the top of the data set, followed by “Sour Cream” and “Lamb”.
One particularly useful application of arrange
is to
pair it with the verb slice
. Here, for example, is the code
to select the six foods in our data set that have the highest amount of
Vitamin A:
%>%
food arrange(desc(vitamin_a)) %>%
slice(1:6)
## # A tibble: 6 × 17
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Carrot vegeta… 41 0.2 0.037 0 69 9.58 2.8 4.54
## 2 Sweet Po… vegeta… 86 0 0.018 0 55 20.1 3 4.18
## 3 Cantalou… fruit 34 0.1 0.051 0 16 8.16 0.9 7.86
## 4 Bell Pep… vegeta… 26 0 0.059 0 2 6.03 2 4.2
## 5 Grapefru… fruit 32 0 0.014 0 0 8.08 1.1 6.98
## 6 Tomato vegeta… 18 0 0.046 0 5 3.92 1.2 2.63
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## # vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## # abbreviated variable names ¹food_group, ²calories, ³total_fat,
## # ⁴cholesterol
By saving this data set, we could highlight these specific foods on top of a plot comparing them to the remainder of the data.
Consider the following subset of the hans
dataset we
worked with in the previous notebook:
## # A tibble: 7 × 4
## country continent life_exp gdp
## <chr> <chr> <dbl> <dbl>
## 1 Canada Americas 80.7 36319.
## 2 France Europe 80.7 30470.
## 3 Ireland Europe 78.9 40676.
## 4 Netherlands Europe 79.8 36798.
## 5 Portugal Europe 78.1 20510.
## 6 Spain Europe 80.9 28821.
## 7 United States Americas 78.2 42952.
For each of the following five questions, write the R code that would produce the desired data set:
country
and
gdp
.