The final core dplyr verb that we will look at is
used to create a new feature in our data set based on other features
that are already present. This verb is called mutate
, and
works by giving it the name of the feature you want to create followed
by the code that describes how to construct the feature in terms of the
rest of the data.
As an example, consider computing the number of calories in an 200g
portion of each food. All of the features in the data set are currently
given as 100g portions, so to compute this we need to multiply the
calories
feature by 2. To do this, we use the
mutate
verb to name and describe a new feature
calories_200g
.
%>%
food mutate(calories_200g = calories * 2)
## # A tibble: 61 × 18
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Apple fruit 52 0.1 0.028 0 1 13.8 2.4 10.4
## 2 Asparag… vegeta… 20 0.1 0.046 0 2 3.88 2.1 1.88
## 3 Avocado fruit 160 14.6 2.13 0 7 8.53 6.7 0.66
## 4 Banana fruit 89 0.3 0.112 0 1 22.8 2.6 12.2
## 5 Chickpea grains 180 2.9 0.309 0 243 30.0 8.6 5.29
## 6 String … vegeta… 31 0.1 0.026 0 6 7.13 3.4 1.4
## 7 Beef meat 288 19.5 7.73 87 384 0 0 0
## 8 Bell Pe… vegeta… 26 0 0.059 0 2 6.03 2 4.2
## 9 Crab fish 87 1 0.222 78 293 0.04 0 0
## 10 Broccoli vegeta… 34 0.3 0.039 0 33 6.64 2.6 1.7
## # … with 51 more rows, 8 more variables: protein <dbl>, iron <dbl>,
## # vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## # color <chr>, calories_200g <dbl>, and abbreviated variable names
## # ¹food_group, ²calories, ³total_fat, ⁴cholesterol
Notice that there is a new feature named calories_200g
that has been added as the last column in the data set. Because it is
added at the end of the data set, it gets hidden in the output shown
above. Making use of select
allows us to see the new
values:
%>%
food mutate(calories_200g = calories * 2) %>%
select(item, food_group, calories, calories_200g)
## # A tibble: 61 × 4
## item food_group calories calories_200g
## <chr> <chr> <dbl> <dbl>
## 1 Apple fruit 52 104
## 2 Asparagus vegetable 20 40
## 3 Avocado fruit 160 320
## 4 Banana fruit 89 178
## 5 Chickpea grains 180 360
## 6 String Bean vegetable 31 62
## 7 Beef meat 288 576
## 8 Bell Pepper vegetable 26 52
## 9 Crab fish 87 174
## 10 Broccoli vegetable 34 68
## # … with 51 more rows
And now we can see that the new column has been created by doubling
the number given the calories
column.
Note that mutate
can also be used to modify any existing
column in the data set by using the name of an extant feature. In this
case the position of the feature within the tables does not change.
The mutate verb itself has a relatively straightforward syntax. The main challenge is knowing how to apply and chain together the various transformations that are useful within an analysis. In the next section, we highlight several common types of operations that we will be useful in subsequent applications.
Many of the uses for the mutate verb involve assigning one value when
a set of conditions is true and another if the conditions are false. For
example, consider creating a new feature called sugar_level
based on the relative amount of sugar in each food item. We might
classify a food has having a “high” sugar level if has more than 10g of
sugar per 100g serving, and a “normal” amount otherwise. In order to
create this feature, we need the function if_else
.
The if_else
function has three parts: a TRUE/FALSE
statement, the value to use when the statement is true, and the value to
use when it is false. Here is an example to create our new feature:
%>%
food mutate(sugar_level = if_else(sugar > 10, "high", "normal")) %>%
select(item, food_group, sugar, sugar_level)
## # A tibble: 61 × 4
## item food_group sugar sugar_level
## <chr> <chr> <dbl> <chr>
## 1 Apple fruit 10.4 high
## 2 Asparagus vegetable 1.88 normal
## 3 Avocado fruit 0.66 normal
## 4 Banana fruit 12.2 high
## 5 Chickpea grains 5.29 normal
## 6 String Bean vegetable 1.4 normal
## 7 Beef meat 0 normal
## 8 Bell Pepper vegetable 4.2 normal
## 9 Crab fish 0 normal
## 10 Broccoli vegetable 1.7 normal
## # … with 51 more rows
Looking at the first rows of data, we see that apples and bananas are classified as high sugar foods, whereas the other sugar levels are given the sugar level category of “normal”.
The if_else
function can be used to produce any number
of categories by using it multiple times. Let’s modify our sugar level
feature to now have three categories: “high” (over 10g), “low” (less
than 1g), and “normal” (between 1g and 10g). There are several different
ways to get to the same result, but I find the easiest is to start by
assigning a default value and then changing the value of the new feature
in sequence. For example, here some code that produces our new
categories:
%>%
food mutate(sugar_level = "default") %>%
mutate(sugar_level = if_else(sugar < 1, "low", sugar_level)) %>%
mutate(sugar_level = if_else(sugar > 10, "high", sugar_level)) %>%
mutate(sugar_level = if_else(between(sugar, 1, 10), "normal", sugar_level)) %>%
select(item, food_group, sugar, sugar_level)
## # A tibble: 61 × 4
## item food_group sugar sugar_level
## <chr> <chr> <dbl> <chr>
## 1 Apple fruit 10.4 high
## 2 Asparagus vegetable 1.88 normal
## 3 Avocado fruit 0.66 low
## 4 Banana fruit 12.2 high
## 5 Chickpea grains 5.29 normal
## 6 String Bean vegetable 1.4 normal
## 7 Beef meat 0 low
## 8 Bell Pepper vegetable 4.2 normal
## 9 Crab fish 0 low
## 10 Broccoli vegetable 1.7 normal
## # … with 51 more rows
In each if_else
step we are telling the mutate function
that if the condition is false set sugar_level
equal to
itself. In other words, if the condition does not hold, do not change
the value of the feature.
In may wonder why we created a “default” value for the feature
sugar_level
. It would have been one less line of code to
set the default value to “normal” and remove the final mutate function.
The reason for the approach above is three-fold. First, it’s easier to
understand what the code is doing in it’s current format because each
condition (“high”, “normal”, and “low”) is explicitly coded. Secondly,
it creates a nice check on our code and data. If we find a row of the
output that still has the value “default” we will know that there is a
problem somewhere. Finally, the code above will more safely handle the
issues with missing values, and issue that we will return to
shortly.
All of summary functions that were introduced in the previous notebook can also be applied within the mutate version. Instead of reducing the data to a single summary row, summarizing within the mutate verb duplicates the summary statistic in each row of the data set. Here is an example of including the average number of calories across all rows of the data set:
%>%
food mutate(calories_mean = mean(calories))
## # A tibble: 61 × 18
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Apple fruit 52 0.1 0.028 0 1 13.8 2.4 10.4
## 2 Asparag… vegeta… 20 0.1 0.046 0 2 3.88 2.1 1.88
## 3 Avocado fruit 160 14.6 2.13 0 7 8.53 6.7 0.66
## 4 Banana fruit 89 0.3 0.112 0 1 22.8 2.6 12.2
## 5 Chickpea grains 180 2.9 0.309 0 243 30.0 8.6 5.29
## 6 String … vegeta… 31 0.1 0.026 0 6 7.13 3.4 1.4
## 7 Beef meat 288 19.5 7.73 87 384 0 0 0
## 8 Bell Pe… vegeta… 26 0 0.059 0 2 6.03 2 4.2
## 9 Crab fish 87 1 0.222 78 293 0.04 0 0
## 10 Broccoli vegeta… 34 0.3 0.039 0 33 6.64 2.6 1.7
## # … with 51 more rows, 8 more variables: protein <dbl>, iron <dbl>,
## # vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## # color <chr>, calories_mean <dbl>, and abbreviated variable names
## # ¹food_group, ²calories, ³total_fat, ⁴cholesterol
As with any call to mutate, all of the original features are kept in
the output and the new feature is added at the end. Using
select
we can verify that the average calories has in fact
been added to each row of the table.
%>%
food mutate(calories_mean = mean(calories)) %>%
select(item, food_group, calories, calories_mean)
## # A tibble: 61 × 4
## item food_group calories calories_mean
## <chr> <chr> <dbl> <dbl>
## 1 Apple fruit 52 114.
## 2 Asparagus vegetable 20 114.
## 3 Avocado fruit 160 114.
## 4 Banana fruit 89 114.
## 5 Chickpea grains 180 114.
## 6 String Bean vegetable 31 114.
## 7 Beef meat 288 114.
## 8 Bell Pepper vegetable 26 114.
## 9 Crab fish 87 114.
## 10 Broccoli vegetable 34 114.
## # … with 51 more rows
The power of mutate summaries becomes particularly clear when grouping the data. If we group the data set by one or more features and apply a summary function within a mutation, the repeated summaries will be done within each group. Here is an example of adding the average calories of each food group to the data set:
%>%
food group_by(food_group) %>%
mutate(calories_mean = mean(calories)) %>%
select(item, food_group, calories, calories_mean)
## # A tibble: 61 × 4
## # Groups: food_group [6]
## item food_group calories calories_mean
## <chr> <chr> <dbl> <dbl>
## 1 Apple fruit 52 54.9
## 2 Asparagus vegetable 20 37.4
## 3 Avocado fruit 160 54.9
## 4 Banana fruit 89 54.9
## 5 Chickpea grains 180 196.
## 6 String Bean vegetable 31 37.4
## 7 Beef meat 288 234.
## 8 Bell Pepper vegetable 26 37.4
## 9 Crab fish 87 167.
## 10 Broccoli vegetable 34 37.4
## # … with 51 more rows
Following this with a filter, for example, would allow us to select all of the foods that have a less than average number of calories within their food group. We will see many examples of grouped mutate summaries throughout our applications.
Your only homework for these notes is to look over the topics on the first exam and come to class with any questions or topics that you would like me to review before the exam.