06. Creating Features

Mutate verb

The final core dplyr verb that we will look at is used to create a new feature in our data set based on other features that are already present. This verb is called mutate, and works by giving it the name of the feature you want to create followed by the code that describes how to construct the feature in terms of the rest of the data.

As an example, consider computing the number of calories in an 200g portion of each food. All of the features in the data set are currently given as 100g portions, so to compute this we need to multiply the calories feature by 2. To do this, we use the mutate verb to name and describe a new feature calories_200g.

food %>%
  mutate(calories_200g = calories * 2)

## # A tibble: 61 × 18
##    item     food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##    <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 Apple    fruit        52     0.1   0.028       0      1 13.8    2.4 10.4 
##  2 Asparag… vegeta…      20     0.1   0.046       0      2  3.88   2.1  1.88
##  3 Avocado  fruit       160    14.6   2.13        0      7  8.53   6.7  0.66
##  4 Banana   fruit        89     0.3   0.112       0      1 22.8    2.6 12.2 
##  5 Chickpea grains      180     2.9   0.309       0    243 30.0    8.6  5.29
##  6 String … vegeta…      31     0.1   0.026       0      6  7.13   3.4  1.4 
##  7 Beef     meat        288    19.5   7.73       87    384  0      0    0   
##  8 Bell Pe… vegeta…      26     0     0.059       0      2  6.03   2    4.2 
##  9 Crab     fish         87     1     0.222      78    293  0.04   0    0   
## 10 Broccoli vegeta…      34     0.3   0.039       0     33  6.64   2.6  1.7 
## # … with 51 more rows, 8 more variables: protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>, calories_200g <dbl>, and abbreviated variable names
## #   ¹food_group, ²calories, ³total_fat, ⁴cholesterol

Notice that there is a new feature named calories_200g that has been added as the last column in the data set. Because it is added at the end of the data set, it gets hidden in the output shown above. Making use of select allows us to see the new values:

food %>%
  mutate(calories_200g = calories * 2) %>%
  select(item, food_group, calories, calories_200g)

## # A tibble: 61 × 4
##    item        food_group calories calories_200g
##    <chr>       <chr>         <dbl>         <dbl>
##  1 Apple       fruit            52           104
##  2 Asparagus   vegetable        20            40
##  3 Avocado     fruit           160           320
##  4 Banana      fruit            89           178
##  5 Chickpea    grains          180           360
##  6 String Bean vegetable        31            62
##  7 Beef        meat            288           576
##  8 Bell Pepper vegetable        26            52
##  9 Crab        fish             87           174
## 10 Broccoli    vegetable        34            68
## # … with 51 more rows

And now we can see that the new column has been created by doubling the number given the calories column.

Note that mutate can also be used to modify any existing column in the data set by using the name of an extant feature. In this case the position of the feature within the tables does not change.

The mutate verb itself has a relatively straightforward syntax. The main challenge is knowing how to apply and chain together the various transformations that are useful within an analysis. In the next section, we highlight several common types of operations that we will be useful in subsequent applications.

Conditional values

Many of the uses for the mutate verb involve assigning one value when a set of conditions is true and another if the conditions are false. For example, consider creating a new feature called sugar_level based on the relative amount of sugar in each food item. We might classify a food has having a “high” sugar level if has more than 10g of sugar per 100g serving, and a “normal” amount otherwise. In order to create this feature, we need the function if_else.

The if_else function has three parts: a TRUE/FALSE statement, the value to use when the statement is true, and the value to use when it is false. Here is an example to create our new feature:

food %>%
  mutate(sugar_level = if_else(sugar > 10, "high", "normal")) %>%
  select(item, food_group, sugar, sugar_level)

## # A tibble: 61 × 4
##    item        food_group sugar sugar_level
##    <chr>       <chr>      <dbl> <chr>      
##  1 Apple       fruit      10.4  high       
##  2 Asparagus   vegetable   1.88 normal     
##  3 Avocado     fruit       0.66 normal     
##  4 Banana      fruit      12.2  high       
##  5 Chickpea    grains      5.29 normal     
##  6 String Bean vegetable   1.4  normal     
##  7 Beef        meat        0    normal     
##  8 Bell Pepper vegetable   4.2  normal     
##  9 Crab        fish        0    normal     
## 10 Broccoli    vegetable   1.7  normal     
## # … with 51 more rows

Looking at the first rows of data, we see that apples and bananas are classified as high sugar foods, whereas the other sugar levels are given the sugar level category of “normal”.

The if_else function can be used to produce any number of categories by using it multiple times. Let’s modify our sugar level feature to now have three categories: “high” (over 10g), “low” (less than 1g), and “normal” (between 1g and 10g). There are several different ways to get to the same result, but I find the easiest is to start by assigning a default value and then changing the value of the new feature in sequence. For example, here some code that produces our new categories:

food %>%
  mutate(sugar_level = "default") %>%
  mutate(sugar_level = if_else(sugar < 1, "low", sugar_level)) %>%
  mutate(sugar_level = if_else(sugar > 10, "high", sugar_level)) %>%
  mutate(sugar_level = if_else(between(sugar, 1, 10), "normal", sugar_level)) %>%
  select(item, food_group, sugar, sugar_level)

## # A tibble: 61 × 4
##    item        food_group sugar sugar_level
##    <chr>       <chr>      <dbl> <chr>      
##  1 Apple       fruit      10.4  high       
##  2 Asparagus   vegetable   1.88 normal     
##  3 Avocado     fruit       0.66 low        
##  4 Banana      fruit      12.2  high       
##  5 Chickpea    grains      5.29 normal     
##  6 String Bean vegetable   1.4  normal     
##  7 Beef        meat        0    low        
##  8 Bell Pepper vegetable   4.2  normal     
##  9 Crab        fish        0    low        
## 10 Broccoli    vegetable   1.7  normal     
## # … with 51 more rows

In each if_else step we are telling the mutate function that if the condition is false set sugar_level equal to itself. In other words, if the condition does not hold, do not change the value of the feature.

In may wonder why we created a “default” value for the feature sugar_level. It would have been one less line of code to set the default value to “normal” and remove the final mutate function. The reason for the approach above is three-fold. First, it’s easier to understand what the code is doing in it’s current format because each condition (“high”, “normal”, and “low”) is explicitly coded. Secondly, it creates a nice check on our code and data. If we find a row of the output that still has the value “default” we will know that there is a problem somewhere. Finally, the code above will more safely handle the issues with missing values, and issue that we will return to shortly.

Mutate summaries

All of summary functions that were introduced in the previous notebook can also be applied within the mutate version. Instead of reducing the data to a single summary row, summarizing within the mutate verb duplicates the summary statistic in each row of the data set. Here is an example of including the average number of calories across all rows of the data set:

food %>%
  mutate(calories_mean = mean(calories))

## # A tibble: 61 × 18
##    item     food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##    <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 Apple    fruit        52     0.1   0.028       0      1 13.8    2.4 10.4 
##  2 Asparag… vegeta…      20     0.1   0.046       0      2  3.88   2.1  1.88
##  3 Avocado  fruit       160    14.6   2.13        0      7  8.53   6.7  0.66
##  4 Banana   fruit        89     0.3   0.112       0      1 22.8    2.6 12.2 
##  5 Chickpea grains      180     2.9   0.309       0    243 30.0    8.6  5.29
##  6 String … vegeta…      31     0.1   0.026       0      6  7.13   3.4  1.4 
##  7 Beef     meat        288    19.5   7.73       87    384  0      0    0   
##  8 Bell Pe… vegeta…      26     0     0.059       0      2  6.03   2    4.2 
##  9 Crab     fish         87     1     0.222      78    293  0.04   0    0   
## 10 Broccoli vegeta…      34     0.3   0.039       0     33  6.64   2.6  1.7 
## # … with 51 more rows, 8 more variables: protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>, calories_mean <dbl>, and abbreviated variable names
## #   ¹food_group, ²calories, ³total_fat, ⁴cholesterol

As with any call to mutate, all of the original features are kept in the output and the new feature is added at the end. Using select we can verify that the average calories has in fact been added to each row of the table.

food %>%
  mutate(calories_mean = mean(calories)) %>%
  select(item, food_group, calories, calories_mean)

## # A tibble: 61 × 4
##    item        food_group calories calories_mean
##    <chr>       <chr>         <dbl>         <dbl>
##  1 Apple       fruit            52          114.
##  2 Asparagus   vegetable        20          114.
##  3 Avocado     fruit           160          114.
##  4 Banana      fruit            89          114.
##  5 Chickpea    grains          180          114.
##  6 String Bean vegetable        31          114.
##  7 Beef        meat            288          114.
##  8 Bell Pepper vegetable        26          114.
##  9 Crab        fish             87          114.
## 10 Broccoli    vegetable        34          114.
## # … with 51 more rows

The power of mutate summaries becomes particularly clear when grouping the data. If we group the data set by one or more features and apply a summary function within a mutation, the repeated summaries will be done within each group. Here is an example of adding the average calories of each food group to the data set:

food %>%
  group_by(food_group) %>%
  mutate(calories_mean = mean(calories)) %>%
  select(item, food_group, calories, calories_mean)

## # A tibble: 61 × 4
## # Groups:   food_group [6]
##    item        food_group calories calories_mean
##    <chr>       <chr>         <dbl>         <dbl>
##  1 Apple       fruit            52          54.9
##  2 Asparagus   vegetable        20          37.4
##  3 Avocado     fruit           160          54.9
##  4 Banana      fruit            89          54.9
##  5 Chickpea    grains          180         196. 
##  6 String Bean vegetable        31          37.4
##  7 Beef        meat            288         234. 
##  8 Bell Pepper vegetable        26          37.4
##  9 Crab        fish             87         167. 
## 10 Broccoli    vegetable        34          37.4
## # … with 51 more rows

Following this with a filter, for example, would allow us to select all of the foods that have a less than average number of calories within their food group. We will see many examples of grouped mutate summaries throughout our applications.

Homework Questions

Your only homework for these notes is to look over the topics on the first exam and come to class with any questions or topics that you would like me to review before the exam.