Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options include=FALSE and message=FALSE to avoid cluttering the solutions with all the output from this code.

Creating New Variables

Mutate verb

The final core dplyr verb that we will look at is used to create a new variable in our data set based on other variables that are already present. This verb is called mutate, and works by giving it the name of the variable you want to create followed by the code that describes how to construct the variable in terms of the rest of the data.

As an example, consider computing the number of calories in an 200g portion of each food. All of the variables in the data set are currently given as 100g portions, so to compute this we need to multiply the calories variables by 2. To do this, we use the mutate verb to name and describe a new variable calories_200g.

food %>%
  mutate(calories_200g = calories * 2)
## # A tibble: 61 x 18
##    item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##    <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
##  1 Apple fruit            52       0.1   0.028           0      1 13.8    2.4
##  2 Aspa… vegetable        20       0.1   0.046           0      2  3.88   2.1
##  3 Avoc… fruit           160      14.6   2.13            0      7  8.53   6.7
##  4 Bana… fruit            89       0.3   0.112           0      1 22.8    2.6
##  5 Chic… grains          180       2.9   0.309           0    243 30.0    8.6
##  6 Stri… vegetable        31       0.1   0.026           0      6  7.13   3.4
##  7 Beef  meat            288      19.5   7.73           87    384  0      0  
##  8 Bell… vegetable        26       0     0.059           0      2  6.03   2  
##  9 Crab  fish             87       1     0.222          78    293  0.04   0  
## 10 Broc… vegetable        34       0.3   0.039           0     33  6.64   2.6
## # … with 51 more rows, and 9 more variables: sugar <dbl>, protein <dbl>,
## #   iron <dbl>, vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>,
## #   description <chr>, color <chr>, calories_200g <dbl>

Notice that there is a new variable named calories_200g that has been added as the last column in the data set. Because it is added at the end of the data set, it gets hidden in the output shown above. Making use of select allows us to see the new values:

food %>%
  mutate(calories_200g = calories * 2) %>%
  select(item, food_group, calories, calories_200g)
## # A tibble: 61 x 4
##    item        food_group calories calories_200g
##    <chr>       <chr>         <dbl>         <dbl>
##  1 Apple       fruit            52           104
##  2 Asparagus   vegetable        20            40
##  3 Avocado     fruit           160           320
##  4 Banana      fruit            89           178
##  5 Chickpea    grains          180           360
##  6 String Bean vegetable        31            62
##  7 Beef        meat            288           576
##  8 Bell Pepper vegetable        26            52
##  9 Crab        fish             87           174
## 10 Broccoli    vegetable        34            68
## # … with 51 more rows

And now we can see that the new column has been created by doubling the number given the calories column.

Note that mutate can also be used to modify any existing column in the data set by using the name of an extant variable. In this case the position of the variable within the tables does not change.

The mutate verb itself has a relatively straightforward syntax. The main challenge is knowing how to apply and chain together the various transformations that are useful within an analysis. In the next few sections, we highlight several common types of operations that we will be useful in subsequent applications.

Conditional values

Many of the uses for the mutate verb involve assigning one value when a set of conditions is true and another if the conditions are false. For example, consider creating a new variable called sugar_level based on the relative amount of sugar in each food item. We might classify a food has having a “high” sugar level if has more than 10g of sugar per 100g serving, and a “normal” amount otherwise. In order to create this variable, we need the function if_else.

The if_else function has three parts: a TRUE/FALSE statement, the value to use when the statement is true, and the value to use when it is false. Here is an example to create our new variable:

food %>%
  mutate(sugar_level = if_else(sugar > 10, "high", "normal")) %>%
  select(item, food_group, sugar, sugar_level)
## # A tibble: 61 x 4
##    item        food_group sugar sugar_level
##    <chr>       <chr>      <dbl> <chr>      
##  1 Apple       fruit      10.4  high       
##  2 Asparagus   vegetable   1.88 normal     
##  3 Avocado     fruit       0.66 normal     
##  4 Banana      fruit      12.2  high       
##  5 Chickpea    grains      5.29 normal     
##  6 String Bean vegetable   1.4  normal     
##  7 Beef        meat        0    normal     
##  8 Bell Pepper vegetable   4.2  normal     
##  9 Crab        fish        0    normal     
## 10 Broccoli    vegetable   1.7  normal     
## # … with 51 more rows

Looking at the first rows of data, we see that apples and bananas are classified as high sugar foods, whereas the other sugar levels are given the sugar level category of “normal”.

The if_else function can be used to produce any number of categories by using it multiple times. Let’s modify our sugar level variable to now have three categories: “high” (over 10g), “low” (less than 1g), and “normal” (between 1g and 10g). There are several different ways to get to the same result, but I find the easiest is to start by assigning a default value and then changing the value of the new variable in sequence. For example, here some code that produces our new categories:

food %>%
  mutate(sugar_level = "default") %>%
  mutate(sugar_level = if_else(sugar < 1, "low", sugar_level)) %>%
  mutate(sugar_level = if_else(sugar > 10, "high", sugar_level)) %>%
  mutate(sugar_level = if_else(between(sugar, 1, 10), "normal", sugar_level)) %>%
  select(item, food_group, sugar, sugar_level)
## # A tibble: 61 x 4
##    item        food_group sugar sugar_level
##    <chr>       <chr>      <dbl> <chr>      
##  1 Apple       fruit      10.4  high       
##  2 Asparagus   vegetable   1.88 normal     
##  3 Avocado     fruit       0.66 low        
##  4 Banana      fruit      12.2  high       
##  5 Chickpea    grains      5.29 normal     
##  6 String Bean vegetable   1.4  normal     
##  7 Beef        meat        0    low        
##  8 Bell Pepper vegetable   4.2  normal     
##  9 Crab        fish        0    low        
## 10 Broccoli    vegetable   1.7  normal     
## # … with 51 more rows

In each if_else step we are telling the mutate function that if the condition is false set sugar_level equal to itself. In other words, if the condition does not hold, do not change the value of the variable.

In may wonder why we created a “default” value for the variable sugar_level. It would have been one less line of code to set the default value to “normal” and remove the final mutate function. The reason for the approach above is three-fold. First, it’s easier to understand what the code is doing in it’s current format because each condition (“high”, “normal”, and “low”) is explicitly coded. Secondly, it creates a nice check on our code and data. If we find a row of the output that still has the value “default” we will know that there is a problem somewhere. Finally, the code above will more safely handle the issues with missing values, and issue that we will return to shortly.

Creating labels

Another common type of manipulation that is used with the mutate verb is the creation of a textual label. This can be done with the function stri_paste. It takes any number of inputs and combines them into a single string. We can mix both variables in the data set as well as fixed strings to create useful labels for data visualizations. For example, if we create a label variable describing each fruit and its sugar content:

food %>%
  mutate(label = stri_paste(item, " (", sugar, "g)")) %>%
  select(item, sugar, label)
## # A tibble: 61 x 3
##    item        sugar label             
##    <chr>       <dbl> <chr>             
##  1 Apple       10.4  Apple (10.39g)    
##  2 Asparagus    1.88 Asparagus (1.88g) 
##  3 Avocado      0.66 Avocado (0.66g)   
##  4 Banana      12.2  Banana (12.23g)   
##  5 Chickpea     5.29 Chickpea (5.29g)  
##  6 String Bean  1.4  String Bean (1.4g)
##  7 Beef         0    Beef (0g)         
##  8 Bell Pepper  4.2  Bell Pepper (4.2g)
##  9 Crab         0    Crab (0g)         
## 10 Broccoli     1.7  Broccoli (1.7g)   
## # … with 51 more rows

The label variable could then be used as a textual label in a subsequent plot with textual data. This is similar to the sm_paste function that we saw when summarizing the values of a character vector.

Factors

R has a special data type called a “factor” (abbreviated “fct”) that is specifically designed to handle categorical variables. It is typically not a good idea to store data as a factor because the resulting variables have some odd, error-producing, behaviors. However, it can be useful to create a factor as part of a mutate function just prior to creating a data visualizations.

For us, biggest difference between factors and character vectors is that a factor vector has a default ordered of its unique values, called the factor’s “levels”. Creating and understanding factors is useful because it allows us to change the ordering of categories within visualizations and models (which by default is done alphabetically).

One of the easiest ways to produce a factor variable with a given order is through the function fct_inorder. It will order the categories in the same order that they (first) appear in the data set. Combining this with the arrange function provides a lot of control over how categories become ordered. For example, the following code produces a bar plot of the food groups in our data set arranged from the largest category to the smallest category:

food %>%
  group_by(food_group) %>%
  summarize(sm_count()) %>%
  arrange(desc(count)) %>%
  mutate(food_group = fct_inorder(food_group)) %>%
  ggplot() +
    geom_col(aes(food_group, count))

Other useful functions for manipulating categories include fct_relevel for manually putting one category first and fct_lump_n for combining together the smallest categories into a collective “Other” category.

Mutate summaries

All of summary functions that were introduced in the previous notebook can also be applied within the mutate version. Instead of reducing the data to a single summary row, summarizing within the mutate verb duplicates the summary statistic in each row of the data set. Here is an example of including the average number of calories across all rows of the data set:

food %>%
  mutate(sm_mean(calories))
## # A tibble: 61 x 18
##    item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##    <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
##  1 Apple fruit            52       0.1   0.028           0      1 13.8    2.4
##  2 Aspa… vegetable        20       0.1   0.046           0      2  3.88   2.1
##  3 Avoc… fruit           160      14.6   2.13            0      7  8.53   6.7
##  4 Bana… fruit            89       0.3   0.112           0      1 22.8    2.6
##  5 Chic… grains          180       2.9   0.309           0    243 30.0    8.6
##  6 Stri… vegetable        31       0.1   0.026           0      6  7.13   3.4
##  7 Beef  meat            288      19.5   7.73           87    384  0      0  
##  8 Bell… vegetable        26       0     0.059           0      2  6.03   2  
##  9 Crab  fish             87       1     0.222          78    293  0.04   0  
## 10 Broc… vegetable        34       0.3   0.039           0     33  6.64   2.6
## # … with 51 more rows, and 9 more variables: sugar <dbl>, protein <dbl>,
## #   iron <dbl>, vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>,
## #   description <chr>, color <chr>, calories_mean <dbl>

As with any call to mutate, all of the original variables are kept in the output and the new variable is added at the end. Using select we can verify that the average calories has in fact been added to each row of the table.

food %>%
  mutate(sm_mean(calories)) %>%
  select(item, food_group, calories, calories_mean)
## # A tibble: 61 x 4
##    item        food_group calories calories_mean
##    <chr>       <chr>         <dbl>         <dbl>
##  1 Apple       fruit            52          114.
##  2 Asparagus   vegetable        20          114.
##  3 Avocado     fruit           160          114.
##  4 Banana      fruit            89          114.
##  5 Chickpea    grains          180          114.
##  6 String Bean vegetable        31          114.
##  7 Beef        meat            288          114.
##  8 Bell Pepper vegetable        26          114.
##  9 Crab        fish             87          114.
## 10 Broccoli    vegetable        34          114.
## # … with 51 more rows

The power of mutate summaries becomes particularly clear when grouping the data. If we group the data set by one or more variables and apply a summary function within a mutation, the repeated summaries will be done within each group. Here is an example of adding the average calories of each food group to the data set:

food %>%
  group_by(food_group) %>%
  mutate(sm_mean(calories)) %>%
  select(item, food_group, calories, calories_mean)
## # A tibble: 61 x 4
## # Groups:   food_group [6]
##    item        food_group calories calories_mean
##    <chr>       <chr>         <dbl>         <dbl>
##  1 Apple       fruit            52          54.9
##  2 Asparagus   vegetable        20          37.4
##  3 Avocado     fruit           160          54.9
##  4 Banana      fruit            89          54.9
##  5 Chickpea    grains          180         196. 
##  6 String Bean vegetable        31          37.4
##  7 Beef        meat            288         234. 
##  8 Bell Pepper vegetable        26          37.4
##  9 Crab        fish             87         167. 
## 10 Broccoli    vegetable        34          37.4
## # … with 51 more rows

Following this with a filter, for example, would allow us to select all of the foods that have a less than average number of calories within their food group. We will see many examples of grouped mutate summaries throughout our applications.

Labels and themes

We have seen a number of ways to create and modify data visualizations. One thing that we did not cover was how to label our axes. While many data visualization guides like to stress the importance of labelling axes, while in the exploratory phase of analysis it is often best to simply use the default labels provided by R. These are useful for a number of reasons. First, they require minimal effort and make it easy to tweak axes, variables, and other settings without spending time tweaking with the labels. Secondly, the default labels use the variable names in our dataset. When writing code this is exactly what we need to know about a variable to use it in additional plots, models, and data manipulation tasks. Of course, once we want to present our results to others, it is essential to provide more detailed descriptions of the axes and legends in our plot. Fortunately, this is relatively easy using the grammar of graphics.

In order to change the labels in a plot, we can use the labs function as an extra part of our plot. Inside the function, we assign labels to the names of aes values that you want to describe. Leaving a value unspecified will keep the default value in place. Labels for the x-axis and y-axis will be go on the sides of the plot. Labels for other aesthetics such as size and color will be placed in the legend. Here is an example of a scatterplot with labels for the three aesthetics:

food %>%
  ggplot() +
    geom_point(aes(x = calories, y = sat_fat, color = food_group)) +
    labs(
      x = "Calories per 100g",
      y = "Saturated Fat (grams per 100g)",
      color = "Food Group"
    )

Notice that the descriptions inside of the labs function is fairly long. The code here breaks it up by putting each argument on its own line (indented a further two spaces). This is good practice when using functions with a lot of arguments.

We can also had a title (and optional subtitle and caption) to the plot by adding these as named arguments to the labs function.

food %>%
  ggplot() +
    geom_point(aes(x = calories, y = sat_fat, color = food_group)) +
    labs(
      title = "Main title",
      subtitle = "A little more detail",
      caption = "Perhaps the source of the data?"
    )

Another way to prepare our graphics for publication is to modify the theme of a plot. Themes effect the way that the plot elements such as the axis labels, ticks, and background look. By default through this book, we have set the default plot to theme_minimal. I think that this is a great choice for exploration of a dataset. As the name implies, it removes most of the clutter of other choices while keeping grid lines and other visual cues to help interpret a dataset. When presenting information for external publication, I prefer to use the theme called theme_sm based on the work of Edward Tufte. To set the theme, just call the following line of code sometime before making your plot:

theme_set(theme_sm())

Now, when we construct a plot it will use the newly assigned theme:

food %>%
  ggplot() +
    geom_point(aes(x = sugar, y = total_fat))