Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options include=FALSE and message=FALSE to avoid cluttering the solutions with all the output from this code.

Organizing Data

Verbs

In these notes we are going to cover a set of functions that take a data frame as an input and return a new version of the data frame. These functions are called verbs and come from the dplyr package. If you are familiar with running database queries, note that all of these verbs map onto SQL commands. In fact, R can be set up so that dplyr is called over a database rather than a local data frame in memory.

There are over 40 verbs in the dplyr package, though most are a minor variant or specific application of another verb. In this notebook we will see only four of them, all of which are related to selecting and arranging rows and columns:

  • select a subset of rows from the original data set (filter and slice)
  • select a subset of columns from the original data set (select)
  • sort the rows of a data set (arrange)

In all verb functions, the first argument is the original data frame and the output is a new data frame. Here, we will also see the functions between and %in% to assist with the filtering command and desc to assist with arranging the rows of a data set.

Note that verbs do not modify the original data; they operate on a copy of the original data. We have to make an explicit name for the new data set if we want to save it for use elsewhere.

Choosing rows

It is often useful to take a subset of the rows of an existing data set, for example if you want to build a model on a certain subpopulation or highlight a particular part of the data in a plot. Perhaps the most straightforward way to take a subset of rows is to indicate the specific row numbers that we want to extract. In order to select rows by row numbers, we use the verb slice, followed by the numbers of the rows we want separated by commas. Here is an example taking the second, fifth, and seventh rows of the data:

food %>%
  slice(2, 5, 7)
## # A tibble: 3 x 17
##   item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##   <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
## 1 Aspa… vegetable        20       0.1   0.046           0      2  3.88   2.1
## 2 Chic… grains          180       2.9   0.309           0    243 30.0    8.6
## 3 Beef  meat            288      19.5   7.73           87    384  0      0  
## # … with 8 more variables: sugar <dbl>, protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>

As mentioned above, the code here does not change the data set food itself. It still has all 61 rows of food contained in it. If we want to create a new data set with just these three food item, we need to explicitly name and assign it. For example, here is how we would create a data set of the first five food items named food_first_five:

food_first_five <- food %>%
  slice(1, 2, 3, 4, 5)

There is a convenient a shorthand for selecting a range of row numbers, for example every row from the tenth to the twentieth, by indicating the starting and ending row number by a colon. Here, for example, is another way to select the first five rows of the data set:

food %>%
  slice(1:5)
## # A tibble: 5 x 17
##   item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##   <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
## 1 Apple fruit            52       0.1   0.028           0      1 13.8    2.4
## 2 Aspa… vegetable        20       0.1   0.046           0      2  3.88   2.1
## 3 Avoc… fruit           160      14.6   2.13            0      7  8.53   6.7
## 4 Bana… fruit            89       0.3   0.112           0      1 22.8    2.6
## 5 Chic… grains          180       2.9   0.309           0    243 30.0    8.6
## # … with 8 more variables: sugar <dbl>, protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>

Another way to take a subset of our data is to select rows based on conditions about the variables in the data set. To do this we use the filter function, which accepts a statement about variable in the data set. Only rows where the statements are true will be returned. For example, here is how we use the filter command to select the foods that have more than 150 calories grams of sugar in each serving:

food %>%
  filter(calories > 150)
## # A tibble: 20 x 17
##    item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##    <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
##  1 Avoc… fruit           160      14.6   2.13            0      7  8.53   6.7
##  2 Chic… grains          180       2.9   0.309           0    243 30.0    8.6
##  3 Beef  meat            288      19.5   7.73           87    384  0      0  
##  4 Catf… fish            240      14.5   3.25           69    398  8.54   0.5
##  5 Chee… dairy           350      26.9  16.6            83    955  4.71   0  
##  6 Chic… meat            237      13.4   3.76           87    404  0      0  
##  7 Clam  fish            180       8     1.60           56    400 11.1    0.5
##  8 Cod   fish            211      10.8   2.22           57    401  8.25   0.5
##  9 Hali… fish            239      17.7   3.10           59    103  0      0  
## 10 Lamb  meat            292      20.7   8.76           96    394  0      0  
## 11 Oat   grains          389       6     1.22            0      2 66.3   10.6
## 12 Oyst… fish            160       7.9   1.85           57    595 12.5    0.5
## 13 Penne grains          157       0.9   0.175           0    233 30.7    1.8
## 14 Pork  meat            271      17     6.17           90    384  0      0  
## 15 Salm… fish            171       7.5   1.31           62    467  0.49   0  
## 16 Scal… fish            217      10.9   2.22           54    487 10.5    0.5
## 17 Sour… dairy           214      20.9  13.0            44     53  4.27   0  
## 18 Swor… fish            177       8.2   1.96           47    494  0.49   0  
## 19 Tuna  fish            153       3.9   0.811          53    366  0.41   0  
## 20 Turk… meat            187       7     2.00           77     69  0      0  
## # … with 8 more variables: sugar <dbl>, protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>

The output data set has only 20 rows, compared to the 62 in the original data. Other comparisons can be done with <, >= and <=. There is also a special function called between that is often useful. For example, here are the rows that have between 2 and 3 grams of total fat:

food %>%
  filter(between(total_fat, 2, 3))
## # A tibble: 4 x 17
##   item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##   <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
## 1 Chic… grains          180       2.9   0.309           0    243 30.0    8.6
## 2 Quin… grains          143       2.2   0.226           0    196 26.4    2.3
## 3 Shri… fish            144       2.3   0.446         206    613  1.24   0  
## 4 Pota… vegetable       104       2     0.458           0    254 19.4    1.7
## # … with 8 more variables: sugar <dbl>, protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>

If you want to filter on a categorical variable, you can use the %in% operator to select specific categories. Here is the code to filter only the fish and vegetable variables:

food %>%
  filter(food_group %in% c("fish", "vegetable"))
## # A tibble: 30 x 17
##    item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##    <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
##  1 Aspa… vegetable        20       0.1   0.046           0      2  3.88   2.1
##  2 Stri… vegetable        31       0.1   0.026           0      6  7.13   3.4
##  3 Bell… vegetable        26       0     0.059           0      2  6.03   2  
##  4 Crab  fish             87       1     0.222          78    293  0.04   0  
##  5 Broc… vegetable        34       0.3   0.039           0     33  6.64   2.6
##  6 Cabb… vegetable        24       0.1   0.016           0     18  5.58   2.3
##  7 Carr… vegetable        41       0.2   0.037           0     69  9.58   2.8
##  8 Catf… fish            240      14.5   3.25           69    398  8.54   0.5
##  9 Caul… vegetable        25       0     0.032           0     30  5.3    2.5
## 10 Cele… vegetable        14       0.1   0.043           0     80  2.97   1.6
## # … with 20 more rows, and 8 more variables: sugar <dbl>, protein <dbl>,
## #   iron <dbl>, vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>,
## #   description <chr>, color <chr>

As with the other verbs, we can chain together multiple calls to produce more complex logic. For example, this code selects fruits that have more than 150 calories per serving:

food %>%
  filter(calories > 150) %>%
  filter(food_group %in% c("fruit"))
## # A tibble: 1 x 17
##   item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##   <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
## 1 Avoc… fruit           160      14.6    2.13           0      7  8.53   6.7
## # … with 8 more variables: sugar <dbl>, protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>

Which results in a reduced data set with only 1 row (avocados). You can also use == to test equality (food_group == "fruit") or != to test whether a variable is not equal to a specific value.

It is also possible to create a chain of calls that then get piped into a call to the ggplot function. For example, here is a plot of the fruits and vegetables with the Avocado outlier removed (by limiting the maximum available total fat).

food %>%
  filter(food_group %in% c("vegetable", "fruit")) %>%
  filter(total_fat < 10) %>%
  ggplot() +
    geom_point(aes(x = calories, y = total_fat, color = food_group)) +
    geom_text_repel(aes(x = calories, y = total_fat, label = item)) +
    scale_color_viridis_d()

The pattern of a starting with a data set, applying a number of transformations, and then creating a visualization of the data will become a common pattern in our analyses.

Data and Layers

Now that we know how to create a subset of our data, let’s use this new knowledge to build some interesting data visualizations. To start, create a data set that just consists of the food types that are in the meat food group:

food_meat <- filter(food, food_group %in% c("meat"))
food_meat
## # A tibble: 6 x 17
##   item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##   <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
## 1 Beef  meat            288      19.5    7.73          87    384     0     0
## 2 Chic… meat            237      13.4    3.76          87    404     0     0
## 3 Duck  meat            132       5.9    2.32          77     74     0     0
## 4 Lamb  meat            292      20.7    8.76          96    394     0     0
## 5 Pork  meat            271      17      6.17          90    384     0     0
## 6 Turk… meat            187       7      2.00          77     69     0     0
## # … with 8 more variables: sugar <dbl>, protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>

One of the core ideas behind the Grammar of Graphics is that complex visualizations can be constructed by layering relatively simply elements on top of one another. What if we wanted to put together two layers where one layer uses the food data set and the other uses food_meat? To do this, we can override the default data set in a layer with the option data =. This will use a different data set within a particular layer. For example, here is how we can layer the meat data set on top of the rest of the food items.

food %>%
  ggplot() +
    geom_point(aes(x = calories, y = total_fat)) +
    geom_point(aes(x = calories, y = total_fat), data = food_meat)

This plot, however, does not look any different than it would if we were just to plot all of the food together. The second layer of points just sits unassumingly on top of the rest of the data. To rectify this, we can color each layer a different color in order to distinguish them from one another. Let’s try to highlight the meat food group in a navy blue, while making the rest of the points a light grey:

food %>%
  ggplot() +
    geom_point(aes(x = calories, y = total_fat), color = "grey85") +
    geom_point(aes(x = calories, y = total_fat), color = "navy", data = food_meat)

We now have a plot that shows exactly where the meats are relative to the other food items. We can further build up the plot by showing the names of just these rows of the dataset as well:

food %>%
  ggplot() +
    geom_point(aes(x = calories, y = total_fat), color = "grey85") +
    geom_point(aes(x = calories, y = total_fat), color = "navy", data = food_meat) +
    geom_text_repel(
      aes(x = calories, y = total_fat, label = item),
      color = "navy",
      data = food_meat
    )