Verbs

In these notes we are going to cover a set of functions that take a data frame as an input and return a new version of the data frame. These functions are called verbs and come from the dplyr package. If you are familiar with running database queries, note that all of these verbs map onto SQL commands. In fact, R can be set up so that dplyr is called over a database rather than a local data frame in memory.

There are over 40 verbs in the dplyr package, though most are a minor variant or specific application of another verb. In this notebook we will see only four of them:

In all verb functions, the first argument is the original data frame and the output is a new data frame. Here, we will also see the functions between and %in% to assist with the filtering command and desc to assist with arranging the rows of a data set.

Note that verbs do not modify the original data; they operate on a copy of the original data. We have to make an explicit name for the new data set if we want to save it for use elsewhere.

Choosing rows

It is often useful to take a subset of the rows of an existing data set, for example if you want to build a model on a certain subpopulation or highlight a particular part of the data in a plot. Perhaps the most straightforward way to take a subset of rows is to indicate the specific row numbers that we want to extract. In order to select rows by row numbers, we use the verb slice, followed by the numbers of the rows we want separated by commas. Here is an example taking the second, fifth, and seventh rows of the data:

food %>%
  slice(2, 5, 7)
## # A tibble: 3 × 17
##   item      food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##   <chr>     <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Asparagus vegeta…      20     0.1   0.046       0      2  3.88   2.1  1.88
## 2 Chickpea  grains      180     2.9   0.309       0    243 30.0    8.6  5.29
## 3 Beef      meat        288    19.5   7.73       87    384  0      0    0   
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## #   vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## #   abbreviated variable names ¹​food_group, ²​calories, ³​total_fat,
## #   ⁴​cholesterol

As mentioned above, the code here does not change the data set food itself. It still has all 61 rows of food contained in it. If we want to create a new data set with just these three food item, we need to explicitly name and assign it. For example, here is how we would create a data set of the first five food items named food_first_five:

food_first_five <- food %>%
  slice(1, 2, 3, 4, 5)

There is a convenient a shorthand for selecting a range of row numbers, for example every row from the tenth to the twentieth, by indicating the starting and ending row number by a colon. Here, for example, is another way to select the first five rows of the data set:

food %>%
  slice(1:5)
## # A tibble: 5 × 17
##   item      food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##   <chr>     <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Apple     fruit        52     0.1   0.028       0      1 13.8    2.4 10.4 
## 2 Asparagus vegeta…      20     0.1   0.046       0      2  3.88   2.1  1.88
## 3 Avocado   fruit       160    14.6   2.13        0      7  8.53   6.7  0.66
## 4 Banana    fruit        89     0.3   0.112       0      1 22.8    2.6 12.2 
## 5 Chickpea  grains      180     2.9   0.309       0    243 30.0    8.6  5.29
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## #   vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## #   abbreviated variable names ¹​food_group, ²​calories, ³​total_fat,
## #   ⁴​cholesterol

Another way to take a subset of our data is to select rows based on conditions about the features in the data set. To do this we use the filter function, which accepts a statement about features in the data set. Only rows where the statements are true will be returned. For example, here is how we use the filter command to select the foods that have more than 150 calories grams of sugar in each serving:

food %>%
  filter(calories > 150)
## # A tibble: 20 × 17
##    item     food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##    <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 Avocado  fruit       160    14.6   2.13        0      7  8.53   6.7  0.66
##  2 Chickpea grains      180     2.9   0.309       0    243 30.0    8.6  5.29
##  3 Beef     meat        288    19.5   7.73       87    384  0      0    0   
##  4 Catfish  fish        240    14.5   3.25       69    398  8.54   0.5  0.85
##  5 Cheese   dairy       350    26.9  16.6        83    955  4.71   0    3.54
##  6 Chicken  meat        237    13.4   3.76       87    404  0      0    0   
##  7 Clam     fish        180     8     1.60       56    400 11.1    0.5  0.85
##  8 Cod      fish        211    10.8   2.22       57    401  8.25   0.5  0.82
##  9 Halibut  fish        239    17.7   3.10       59    103  0      0    0   
## 10 Lamb     meat        292    20.7   8.76       96    394  0      0    0   
## 11 Oat      grains      389     6     1.22        0      2 66.3   10.6  0   
## 12 Oyster   fish        160     7.9   1.85       57    595 12.5    0.5  0.71
## 13 Penne    grains      157     0.9   0.175       0    233 30.7    1.8  0.56
## 14 Pork     meat        271    17     6.17       90    384  0      0    0   
## 15 Salmon   fish        171     7.5   1.31       62    467  0.49   0    0.11
## 16 Scallop  fish        217    10.9   2.22       54    487 10.5    0.5  0.82
## 17 Sour Cr… dairy       214    20.9  13.0        44     53  4.27   0    0.16
## 18 Swordfi… fish        177     8.2   1.96       47    494  0.49   0    0.11
## 19 Tuna     fish        153     3.9   0.811      53    366  0.41   0    0.09
## 20 Turkey   meat        187     7     2.00       77     69  0      0    0   
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## #   vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## #   abbreviated variable names ¹​food_group, ²​calories, ³​total_fat,
## #   ⁴​cholesterol

The output data set has only 20 rows, compared to the 62 in the original data. Other comparisons can be done with <, >= and <=. There is also a special function called between that is often useful. For example, here are the rows that have between 2 and 3 grams of total fat:

food %>%
  filter(between(total_fat, 2, 3))
## # A tibble: 4 × 17
##   item     food_g…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##   <chr>    <chr>      <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Chickpea grains       180     2.9   0.309       0    243 30.0    8.6  5.29
## 2 Quinoa   grains       143     2.2   0.226       0    196 26.4    2.3  0   
## 3 Shrimp   fish         144     2.3   0.446     206    613  1.24   0    0   
## 4 Potato   vegetab…     104     2     0.458       0    254 19.4    1.7  0.82
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## #   vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## #   abbreviated variable names ¹​food_group, ²​calories, ³​total_fat,
## #   ⁴​cholesterol

If you want to filter on a categorical feature, you can use the %in% operator to select specific categories. Here is the code to filter only the fish and vegetable food groups:

food %>%
  filter(food_group %in% c("fish", "vegetable"))
## # A tibble: 30 × 17
##    item     food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##    <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 Asparag… vegeta…      20     0.1   0.046       0      2  3.88   2.1  1.88
##  2 String … vegeta…      31     0.1   0.026       0      6  7.13   3.4  1.4 
##  3 Bell Pe… vegeta…      26     0     0.059       0      2  6.03   2    4.2 
##  4 Crab     fish         87     1     0.222      78    293  0.04   0    0   
##  5 Broccoli vegeta…      34     0.3   0.039       0     33  6.64   2.6  1.7 
##  6 Cabbage  vegeta…      24     0.1   0.016       0     18  5.58   2.3  3.58
##  7 Carrot   vegeta…      41     0.2   0.037       0     69  9.58   2.8  4.54
##  8 Catfish  fish        240    14.5   3.25       69    398  8.54   0.5  0.85
##  9 Caulifl… vegeta…      25     0     0.032       0     30  5.3    2.5  2.4 
## 10 Celery   vegeta…      14     0.1   0.043       0     80  2.97   1.6  1.83
## # … with 20 more rows, 7 more variables: protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>, and abbreviated variable names ¹​food_group, ²​calories,
## #   ³​total_fat, ⁴​cholesterol

As with the other verbs, we can chain together multiple calls to produce more complex logic. For example, this code selects fruits that have more than 150 calories per serving:

food %>%
  filter(calories > 150) %>%
  filter(food_group %in% c("fruit"))
## # A tibble: 1 × 17
##   item    food_gr…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##   <chr>   <chr>       <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Avocado fruit         160    14.6    2.13       0      7  8.53   6.7  0.66
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## #   vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## #   abbreviated variable names ¹​food_group, ²​calories, ³​total_fat,
## #   ⁴​cholesterol

Which results in a reduced data set with only 1 row (avocados). You can also use == to test equality (food_group == "fruit") or != to test whether a feature is not equal to a specific value.

It is also possible to create a chain of calls that then get piped into a call to the ggplot function. For example, here is a plot of the fruits and vegetables with the Avocado outlier removed (by limiting the maximum available total fat).

food %>%
  filter(food_group %in% c("vegetable", "fruit")) %>%
  filter(total_fat < 10) %>%
  ggplot() +
    geom_point(aes(x = calories, y = total_fat, color = food_group)) +
    geom_text_repel(aes(x = calories, y = total_fat, label = item)) +
    scale_color_viridis_d()

The pattern of a starting with a data set, applying a number of transformations, and then creating a visualization of the data will become a common pattern in our analyses.

Data and Layers

Now that we know how to create a subset of our data, let’s use this new knowledge to build some interesting data visualizations. To start, create a data set that just consists of the food types that are in the meat food group:

food_meat <- filter(food, food_group %in% c("meat"))
food_meat
## # A tibble: 6 × 17
##   item    food_gr…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##   <chr>   <chr>       <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Beef    meat          288    19.5    7.73      87    384     0     0     0
## 2 Chicken meat          237    13.4    3.76      87    404     0     0     0
## 3 Duck    meat          132     5.9    2.32      77     74     0     0     0
## 4 Lamb    meat          292    20.7    8.76      96    394     0     0     0
## 5 Pork    meat          271    17      6.17      90    384     0     0     0
## 6 Turkey  meat          187     7      2.00      77     69     0     0     0
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## #   vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## #   abbreviated variable names ¹​food_group, ²​calories, ³​total_fat,
## #   ⁴​cholesterol

One of the core ideas behind the Grammar of Graphics is that complex visualizations can be constructed by layering relatively simply elements on top of one another. What if we wanted to put together two layers where one layer uses the food data set and the other uses food_meat? To do this, we can override the default data set in a layer with the option data =. This will use a different data set within a particular layer. For example, here is how we can layer the meat data set on top of the rest of the food items.

food %>%
  ggplot() +
    geom_point(aes(x = calories, y = total_fat)) +
    geom_point(aes(x = calories, y = total_fat), data = food_meat)

This plot, however, does not look any different than it would if we were just to plot all of the food together. The second layer of points just sits unassumingly on top of the rest of the data. To rectify this, we can color each layer a different color in order to distinguish them from one another. Let’s try to highlight the meat food group in a navy blue, while making the rest of the points a light grey:

food %>%
  ggplot() +
    geom_point(aes(x = calories, y = total_fat), color = "grey85") +
    geom_point(aes(x = calories, y = total_fat), color = "navy", data = food_meat)

We now have a plot that shows exactly where the meats are relative to the other food items. We can further build up the plot by showing the names of just these rows of the dataset as well:

food %>%
  ggplot() +
    geom_point(aes(x = calories, y = total_fat), color = "grey85") +
    geom_point(aes(x = calories, y = total_fat), color = "navy", data = food_meat) +
    geom_text_repel(
      aes(x = calories, y = total_fat, label = item),
      color = "navy",
      data = food_meat
    )

Notice that the code is starting to get a bit more complicated and some of the graphic layers are becoming a bit long. This is a good place to use the shorthand notation to inherit aesthetics across layers, like this:

food %>%
  ggplot(aes(calories, total_fat)) +
    geom_point(color = "grey85") +
    geom_point(color = "navy", data = food_meat) +
    geom_text_repel(aes(label = item), color = "navy", data = food_meat)

Notice how a relatively small set of commands can be put together in different ways to build a variety of plots. Already, we are making further progress towards building informative and beautiful graphics in R!

Selecting Columns

It is also possible to take a subset of the columns in a data set. To do this, we make use of the verb select. We pass it the names of the features we want to keep in the output data set, in the (possibly new) order that we want the columns to be arranged in. Here, for example, is a new version of the foods data set containing only the food item name followed by the amount of Vitamin A and Vitamin C:

food %>%
  select(item, vitamin_a, vitamin_c)
## # A tibble: 61 × 3
##    item        vitamin_a vitamin_c
##    <chr>           <dbl>     <dbl>
##  1 Apple               1         8
##  2 Asparagus          15         9
##  3 Avocado             3        17
##  4 Banana              1        15
##  5 Chickpea            0         3
##  6 String Bean        14        27
##  7 Beef                0         0
##  8 Bell Pepper        63       317
##  9 Crab                0         5
## 10 Broccoli           12       149
## # … with 51 more rows

We will not need to use the select verb as often as filter because for the most part having extra features around does not effect data visualizations or data models. However, it can be useful to displaying results. As we saw above, the Vitamin A and Vitamin C columns were cut-off in the original output but are not visible in the selected data set version. Removing and reordering unneeded columns will also be useful in some of the advanced applications that are discussed in the following classes.

Arranging Rows

The verbs slice and filter determine a subset of rows to keep from the original data set. The arrange verb, in contrast, keeps all of the original data but re-orders its rows. Specifically, we give it one or more feature names and it sorts the data by the first feature from smallest to largest (or alphabetically for character feature). In the case of ties, the second feature is used if given. More features can be given to further break additional ties. Here is an example where we order the data set first by food_group and then by calories:

food %>%
  arrange(food_group, calories)
## # A tibble: 61 × 17
##    item     food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##    <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 Milk     dairy        60     3.2   1.86       10     40  4.52   0    5.26
##  2 Yogurt   dairy        99     1.1   0.742       5     53 18.6    0   18.6 
##  3 Sour Cr… dairy       214    20.9  13.0        44     53  4.27   0    0.16
##  4 Cheese   dairy       350    26.9  16.6        83    955  4.71   0    3.54
##  5 Crab     fish         87     1     0.222      78    293  0.04   0    0   
##  6 Lobster  fish         97     0.5   0.106      71    700  1.27   0    0   
##  7 Haddock  fish        128     3.7   0.69       66    402  0.48   0.1  0.11
##  8 Flounder fish        133     4.2   0.867      56    417  0.41   0    0.09
##  9 Shrimp   fish        144     2.3   0.446     206    613  1.24   0    0   
## 10 Tuna     fish        153     3.9   0.811      53    366  0.41   0    0.09
## # … with 51 more rows, 7 more variables: protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>, and abbreviated variable names ¹​food_group, ²​calories,
## #   ³​total_fat, ⁴​cholesterol

In the new data set all of the dairy products come up first followed by the fish products. Within each group, the items are sorted from the lowest to highest number of calories.

The ordering can be reversed (i.e., from the highest to the lowest value) be wrapping a feature name in the function desc(), such as this ordering from the most saturated fat to the least:

food %>%
  arrange(desc(sat_fat))
## # A tibble: 61 × 17
##    item     food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##    <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 Cheese   dairy       350    26.9   16.6       83    955  4.71   0    3.54
##  2 Sour Cr… dairy       214    20.9   13.0       44     53  4.27   0    0.16
##  3 Lamb     meat        292    20.7    8.76      96    394  0      0    0   
##  4 Beef     meat        288    19.5    7.73      87    384  0      0    0   
##  5 Pork     meat        271    17      6.17      90    384  0      0    0   
##  6 Chicken  meat        237    13.4    3.76      87    404  0      0    0   
##  7 Catfish  fish        240    14.5    3.25      69    398  8.54   0.5  0.85
##  8 Halibut  fish        239    17.7    3.10      59    103  0      0    0   
##  9 Duck     meat        132     5.9    2.32      77     74  0      0    0   
## 10 Cod      fish        211    10.8    2.22      57    401  8.25   0.5  0.82
## # … with 51 more rows, 7 more variables: protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>, and abbreviated variable names ¹​food_group, ²​calories,
## #   ³​total_fat, ⁴​cholesterol

In the result here, “Cheese” has been placed at the top of the data set, followed by “Sour Cream” and “Lamb”.

One particularly useful application of arrange is to pair it with the verb slice. Here, for example, is the code to select the six foods in our data set that have the highest amount of Vitamin A:

food %>%
  arrange(desc(vitamin_a)) %>%
  slice(1:6)
## # A tibble: 6 × 17
##   item      food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##   <chr>     <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Carrot    vegeta…      41     0.2   0.037       0     69  9.58   2.8  4.54
## 2 Sweet Po… vegeta…      86     0     0.018       0     55 20.1    3    4.18
## 3 Cantalou… fruit        34     0.1   0.051       0     16  8.16   0.9  7.86
## 4 Bell Pep… vegeta…      26     0     0.059       0      2  6.03   2    4.2 
## 5 Grapefru… fruit        32     0     0.014       0      0  8.08   1.1  6.98
## 6 Tomato    vegeta…      18     0     0.046       0      5  3.92   1.2  2.63
## # … with 7 more variables: protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## #   vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>, and
## #   abbreviated variable names ¹​food_group, ²​calories, ³​total_fat,
## #   ⁴​cholesterol

By saving this data set, we could highlight these specific foods on top of a plot comparing them to the remainder of the data.

Homework Questions

Consider the following subset of the hans dataset we worked with in the previous notebook:

## # A tibble: 7 × 4
##   country       continent life_exp    gdp
##   <chr>         <chr>        <dbl>  <dbl>
## 1 Canada        Americas      80.7 36319.
## 2 France        Europe        80.7 30470.
## 3 Ireland       Europe        78.9 40676.
## 4 Netherlands   Europe        79.8 36798.
## 5 Portugal      Europe        78.1 20510.
## 6 Spain         Europe        80.9 28821.
## 7 United States Americas      78.2 42952.

For each of the following five questions, write the R code that would produce the desired data set:

  1. Filter to include only those countries that have a life expectancy above 80 years.
  2. Order the data set from the highest life expectancy to the lowest life expectancy.
  3. Select only those rows that are located in Europe.
  4. Create a data set that only has all seven rows of the original data but includes only the features country and gdp.
  5. Create a data set that takes only the first two rows of the starting data.