Overview (read this section twice)

These notes all focus on the core concept of a primary key. A primary key is a feature or group of features (in other words, a column or group of columns) in a dataset that uniquely identifies a row of data. In our food dataset, the feature item is a primary key. In the us dataset, the city and year columns together form a primary key. We need both to uniquely identify one observation.

Likewise, a foreign key is the appearance of a primary key within a different dataset. The food group feature in the food dataset, for example, could be a foreign key if we had another table with one row describing information about each food group.

A primary key and the corresponding foreign key in another table form a relation. Typically a relation maps a single row in one dataset to many rows in another.

A table join is a way of combining two tables based on relations. The goal is to match up a foreign key in one table with the primary key in another table, to add new features from one dataset into another dataset. Several examples will make this notion more clear.

Joining data by key

As an example of performing table joins, we will start with a pared down version of our foods data:

food_sml <- food %>%
  select(item, food_group, calories)

food_sml
## # A tibble: 61 × 3
##    item        food_group calories
##    <chr>       <chr>         <dbl>
##  1 Apple       fruit            52
##  2 Asparagus   vegetable        20
##  3 Avocado     fruit           160
##  4 Banana      fruit            89
##  5 Chickpea    grains          180
##  6 String Bean vegetable        31
##  7 Beef        meat            288
##  8 Bell Pepper vegetable        26
##  9 Crab        fish             87
## 10 Broccoli    vegetable        34
## # … with 51 more rows

Now, consider the following new dataset corresponding to dietary restrictions associated with different food groups:

diet <- read_csv(file.path("..", "data", "food_diet_restrictions.csv"))
diet
## # A tibble: 6 × 4
##   food_group vegan vegetarian pescatarian
##   <chr>      <chr> <chr>      <chr>      
## 1 fruit      yes   yes        yes        
## 2 vegetable  yes   yes        yes        
## 3 grains     yes   yes        yes        
## 4 meat       no    no         no         
## 5 fish       no    no         yes        
## 6 dairy      no    yes        yes

In the diet table, the food group feature is a primary key. Within the food table, it is a foreign key. What we would like to do is to combine these datasets by matching up rows that have the same values in the corresponding columns.

For example, it could be helpful to include the columns vegan, vegetarian, and pescatarian in the foods dataset. In order to do this, we need to associate a row in the food dataset with the row in the diet dataset that contains the same value of the feature food_group. We do this using the function left_join; we pipe in the larger dataset that we want to join columns to, provide the name of the table to grab columns from, and indicate what column will be used to combine the two datasets.

food_sml %>%
  left_join(diet, by = c("food_group"))
## # A tibble: 61 × 6
##    item        food_group calories vegan vegetarian pescatarian
##    <chr>       <chr>         <dbl> <chr> <chr>      <chr>      
##  1 Apple       fruit            52 yes   yes        yes        
##  2 Asparagus   vegetable        20 yes   yes        yes        
##  3 Avocado     fruit           160 yes   yes        yes        
##  4 Banana      fruit            89 yes   yes        yes        
##  5 Chickpea    grains          180 yes   yes        yes        
##  6 String Bean vegetable        31 yes   yes        yes        
##  7 Beef        meat            288 no    no         no         
##  8 Bell Pepper vegetable        26 yes   yes        yes        
##  9 Crab        fish             87 no    no         yes        
## 10 Broccoli    vegetable        34 yes   yes        yes        
## # … with 51 more rows

And that is it! We now have the dietary restrictions added into the food dataset. We can now compute models, summaries, and construct visualizations based on the new metadata associated with each food group.

Now, let’s consider a third dataset containing information about food recipes, one for a Pot Roast and another for Guacamole:

recipes <- read_csv(file.path("..", "data", "food_recipes.csv"))
recipes
## # A tibble: 10 × 3
##    recipe    ingredient amount
##    <chr>     <chr>       <dbl>
##  1 Pot Roast Beef         1200
##  2 Pot Roast Carrot        400
##  3 Pot Roast Potato       1000
##  4 Pot Roast Onion         500
##  5 Pot Roast Tomato        200
##  6 Pot Roast Bay Leaf        5
##  7 Guacamole Avocado      1000
##  8 Guacamole Onion         500
##  9 Guacamole Tomato        500
## 10 Guacamole Lime          150

In this dataset, ingredient is a foreign key corresponding to the primary key “item” in the foods dataset. For the recipes data, the primary key requires two columns: both recipe and ingredient are needed to describe each row uniquely.

We can now try adding the calories information from the food dataset into the recipes dataset. The complication here is that the column we want to join on has a different name in each dataset (“ingredient” versus “item”). To specify this, specify the two different names within the by = argument of the join function:

recipes %>%
  left_join(food_sml, by = c("ingredient" = "item"))
## # A tibble: 10 × 5
##    recipe    ingredient amount food_group calories
##    <chr>     <chr>       <dbl> <chr>         <dbl>
##  1 Pot Roast Beef         1200 meat            288
##  2 Pot Roast Carrot        400 vegetable        41
##  3 Pot Roast Potato       1000 vegetable       104
##  4 Pot Roast Onion         500 vegetable        42
##  5 Pot Roast Tomato        200 vegetable        18
##  6 Pot Roast Bay Leaf        5 <NA>             NA
##  7 Guacamole Avocado      1000 fruit           160
##  8 Guacamole Onion         500 vegetable        42
##  9 Guacamole Tomato        500 vegetable        18
## 10 Guacamole Lime          150 fruit            30

Now, we have a dataset that has added the food group and calories information to our recipe dataset. Notice that the ingredient “Bay Leaf” is not in our food dataset. Because of this, the food group and calories information for this ingredient are filled in with missing values.

Notice that rows of data about food items in the dataset food that are not in any recipe are not included in the output. This is where the term “left” comes from: all rows from the “left” (e.g., first) table are returned, but non-matching rows in the second (e.g., “right”) table are dropped.

Inner and outer joins

There are four mutating joins available in dplyr. These have the same syntax. If the join key is a primary key in one table and there are no missing matches, they all function exactly the same. If the join key is a primary key in one table and not all foreign keys in the other table match an entry in the other, the following happens:

For example, by replacing left_join with the function inner_join we can modify the join from the last section to drop the missing term (Bay Leaf) that is not contained in our food dataset:

recipes %>%
  inner_join(food_sml, by = c("ingredient" = "item"))
## # A tibble: 9 × 5
##   recipe    ingredient amount food_group calories
##   <chr>     <chr>       <dbl> <chr>         <dbl>
## 1 Pot Roast Beef         1200 meat            288
## 2 Pot Roast Carrot        400 vegetable        41
## 3 Pot Roast Potato       1000 vegetable       104
## 4 Pot Roast Onion         500 vegetable        42
## 5 Pot Roast Tomato        200 vegetable        18
## 6 Guacamole Avocado      1000 fruit           160
## 7 Guacamole Onion         500 vegetable        42
## 8 Guacamole Tomato        500 vegetable        18
## 9 Guacamole Lime          150 fruit            30

These four joins are called “mutating” joins because, like the mutate function, they add columns to a dataset. There are also two “filter” joins that serve only to select rows in the first (left) table. The function semi_join selects rows in the first table that has a match in the second and anti_join selects rows in the first table that do not have a match in the second. These are useful for several data science tasks and will come up from time-to-time in our applications.

Finally, there is another join called a “nested” join (nest_join). It has some very interesting applications, but we avoid using it this semester.

Homework Question

Hand write a dataset that consists of at least two different tables where one table has a primary key that appears as foreign key on the second table. The tables do not need to be large (10 rows between both tables is more than enough) and the subject of the tables is up to you. Be creative and don’t just copy the example above.