Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options include=FALSE and message=FALSE to avoid cluttering the solutions with all the output from this code.

Relational data

A primary key consists of one or more variables that uniquely identify a row of data. In our food dataset, the variable item is a primary key. Each row has a unique value for this variable. Likewise, a foreign key is the appearance of a primary key within a different dataset. The food group variable, for example, could be a foreign key if we had another table with one row describing information about each food group.

A primary key and the corresponding foreign key in another table form a relation. Typically a relation maps a single row in one dataset to many rows in another.

A table join is a way of combining two tables based on relations. The goal is to match up a foreign key in one table with the primary key in another table, to add new variables from one dataset into another dataset. Several examples will make this notion more clear.

Joining data by key

As an example of performing table joins, we will start with a paired down version of our foods data:

food_sml <- food %>%
  select(item, food_group, calories)

food_sml
## # A tibble: 61 x 3
##    item        food_group calories
##    <chr>       <chr>         <dbl>
##  1 Apple       fruit            52
##  2 Asparagus   vegetable        20
##  3 Avocado     fruit           160
##  4 Banana      fruit            89
##  5 Chickpea    grains          180
##  6 String Bean vegetable        31
##  7 Beef        meat            288
##  8 Bell Pepper vegetable        26
##  9 Crab        fish             87
## 10 Broccoli    vegetable        34
## # … with 51 more rows

Now, consider the following new dataset corresponding to dietary restrictions associated with different food groups:

diet <- read_csv(file.path("data", "food_diet_restrictions.csv"))
diet
## # A tibble: 6 x 4
##   food_group vegan vegetarian pescatarian
##   <chr>      <chr> <chr>      <chr>
## 1 fruit      yes   yes        yes
## 2 vegetable  yes   yes        yes
## 3 grains     yes   yes        yes
## 4 meat       no    no         no
## 5 fish       no    no         yes
## 6 dairy      no    yes        yes

In the diet table, the food group variable is a primary key. Within the food table, it is a foreign key. What we would like to do is to combine these datasets by matching up rows that have the same values in the corresponding columns.

For example, it could be helpful to include the columns vegan, vegetarian, and pescatarian in the foods dataset. In order to do this, we need to associate a row in the food dataset with the row in the diet dataset that contains the same value of the variable food_group. We do this using the function left_join; we pipe in the larger dataset that we want to join columns to, provide the name of the table to grab columns from, and indicate what column will be used to combine the two datasets.

food_sml %>%
  left_join(diet, by = c("food_group"))
## # A tibble: 61 x 6
##    item        food_group calories vegan vegetarian pescatarian
##    <chr>       <chr>         <dbl> <chr> <chr>      <chr>
##  1 Apple       fruit            52 yes   yes        yes
##  2 Asparagus   vegetable        20 yes   yes        yes
##  3 Avocado     fruit           160 yes   yes        yes
##  4 Banana      fruit            89 yes   yes        yes
##  5 Chickpea    grains          180 yes   yes        yes
##  6 String Bean vegetable        31 yes   yes        yes
##  7 Beef        meat            288 no    no         no
##  8 Bell Pepper vegetable        26 yes   yes        yes
##  9 Crab        fish             87 no    no         yes
## 10 Broccoli    vegetable        34 yes   yes        yes
## # … with 51 more rows

And that is it! We now have the dietary restrictions added into the food dataset. We can now compute models, summaries, and construct visualizations based on the new metadata associated with each food group.

Now, let’s consider a third dataset containing information about food recipes, one for a Pot Roast and another for Guacamole:

recipes <- read_csv(file.path("data", "food_recipes.csv"))
recipes
## # A tibble: 10 x 3
##    recipe    ingredient amount
##    <chr>     <chr>       <dbl>
##  1 Pot Roast Beef         1200
##  2 Pot Roast Carrot        400
##  3 Pot Roast Potato       1000
##  4 Pot Roast Onion         500
##  5 Pot Roast Tomato        200
##  6 Pot Roast Bay Leaf        5
##  7 Guacamole Avocado      1000
##  8 Guacamole Onion         500
##  9 Guacamole Tomato        500
## 10 Guacamole Lime          150

In this dataset, ingredient is a foreign key corresponding to the primary key “item” in the foods dataset. For the recipes data, the primary key requires two columns: both recipe and ingredient are needed to describe each row uniquely.

We can now try adding the calories information from the food dataset into the recipes dataset. The complication here is that the column we want to join on has a different name in each dataset (“ingredient” versus “item”). To specify this, specify the two different names within the by = argument of the join function:

recipes %>%
  left_join(food_sml, by = c("ingredient" = "item"))
## # A tibble: 10 x 5
##    recipe    ingredient amount food_group calories
##    <chr>     <chr>       <dbl> <chr>         <dbl>
##  1 Pot Roast Beef         1200 meat            288
##  2 Pot Roast Carrot        400 vegetable        41
##  3 Pot Roast Potato       1000 vegetable       104
##  4 Pot Roast Onion         500 vegetable        42
##  5 Pot Roast Tomato        200 vegetable        18
##  6 Pot Roast Bay Leaf        5 <NA>             NA
##  7 Guacamole Avocado      1000 fruit           160
##  8 Guacamole Onion         500 vegetable        42
##  9 Guacamole Tomato        500 vegetable        18
## 10 Guacamole Lime          150 fruit            30

Now, we have a dataset that has added the food group and calories information to our recipe dataset. Notice that the ingredient “Bay Leaf” is not in our food dataset. Because of this, the food group and calories information for this ingredient are filled in with missing values.

Notice that rows of data about food items in the dataset food that are not in any recipe are not included in the output. This is where the term “left” comes from: all rows from the “left” (e.g., first) table are returned, but non-matching rows in the second (e.g., “right”) table are dropped.

Inner and outer joins

There are four mutating joins available in dplyr. These have the same syntax. If the join key is a primary key in one table and there are no missing matches, they all function exactly the same. If the join key is a primary key in one table and not all foreign keys in the other table match an entry in the other, the following happens:

  • left_join only non-matching rows in the first dataset are kept
  • right_join only non-matching rows in the second dataset are kept
  • full_join non-matching rows are included from either table
  • inner_join only matching rows from both tables are included

For example, by replacing left_join with the function inner_join we can modify the join from the last section to drop the missing term (Bay Leaf) that is not contained in our food dataset:

recipes %>%
  inner_join(food_sml, by = c("ingredient" = "item"))
## # A tibble: 9 x 5
##   recipe    ingredient amount food_group calories
##   <chr>     <chr>       <dbl> <chr>         <dbl>
## 1 Pot Roast Beef         1200 meat            288
## 2 Pot Roast Carrot        400 vegetable        41
## 3 Pot Roast Potato       1000 vegetable       104
## 4 Pot Roast Onion         500 vegetable        42
## 5 Pot Roast Tomato        200 vegetable        18
## 6 Guacamole Avocado      1000 fruit           160
## 7 Guacamole Onion         500 vegetable        42
## 8 Guacamole Tomato        500 vegetable        18
## 9 Guacamole Lime          150 fruit            30

These four joins are called “mutating” joins because, like the mutate function, they add columns to a dataset. There are also two “filter” joins that serve only to select rows in the first (left) table. The function semi_join selects rows in the first table that has a match in the second and anti_join selects rows in the first table that do not have a match in the second. These are useful for several data science tasks and will come up from time-to-time in our applications.

Finally, there is another join called a “nested” join (nest_join). We may use this when doing web scraping towards the end of the semester, but we likely will avoid it until then as it is a bit more complicated.

Practice

Largest City and Köppen Datasets

In this notebook we will work with the largest cities dataset:

cities <- read_csv(file.path("data", "largest_cities.csv"))

We will also load a dataset containing metadata about the Köppen climate classification scores.

koppen <- read_csv(file.path("data", "koppen_metadata.csv"))
koppen
## # A tibble: 31 x 3
##    code  name                                                     color
##    <chr> <chr>                                                    <chr>
##  1 Af    Tropical rainforest climate                              #0000FE
##  2 Am    Tropical monsoon climate                                 #0077FF
##  3 Aw    Tropical savanna climate with dry-winter characteristics #46A9FA
##  4 As    Tropical savanna climate with dry-summer characteristics #46A9FA
##  5 BWh   Hot desert climates                                      #FE0000
##  6 BWk   Cold desert climates                                     #FE9695
##  7 BSh   Hot semi-arid climates                                   #F5A301
##  8 BSk   Cold semi-arid climates                                  #FFDB63
##  9 Csa   Mediterranean hot summer climates                        #FFFF00
## 10 Csb   Mediterranean warm/cool summer climates                  #C6C700
## # … with 21 more rows

Joining Climate Data

In the code block below, I am using the left_join function to combine the cities dataset with the koppen metadata. Note that the key has a different name in each of the two datasets.

cities %>%
  left_join(koppen, by = c("koppen_code" = "code"))
## # A tibble: 81 x 28
##    name.x country city_definition population city_pop city_area metro_pop
##    <chr>  <chr>   <chr>                <dbl>    <dbl>     <dbl>     <dbl>
##  1 Tokyo  Japan   Metropolis pre…       37.4    13.5       2191      37.3
##  2 Delhi  India   National capit…       28.5    16.8       1484      29
##  3 Shang… China   Municipality          25.6    24.2       6341      NA
##  4 São P… Brazil  Municipality          21.6    12.3       1521      21.7
##  5 Mexic… Mexico  City-state            21.6     8.92      1485      20.9
##  6 Cairo  Egypt   Urban governor…       20.1     9.5       3085      NA
##  7 Mumbai India   Municipality          20.0    12.5        603      24.4
##  8 Beiji… China   Municipality          19.6    21.7      16411      NA
##  9 Dhaka  Bangla… Capital city          19.6    14.4        338      14.5
## 10 Osaka  Japan   Designated city       19.3     2.72       225      19.3
## # … with 71 more rows, and 21 more variables: metro_area <dbl>,
## #   urban_pop <dbl>, urban_area <dbl>, wiki <chr>, country_code2 <chr>,
## #   country_code3 <chr>, country_name_official <chr>, continent <chr>,
## #   lon <dbl>, lat <dbl>, koppen_code <chr>, koppen_main <chr>, city <chr>,
## #   num <dbl>, cost_of_living <dbl>, cost_rent <dbl>, cost_groceries <dbl>,
## #   cost_restaurant <dbl>, local_pp <dbl>, name.y <chr>, color <chr>

Take a close look at the dataset you created in the last question. Scrolling all the way to the final columns, you should see the name of the climate region and the default color added into the dataset.

Something, however, has happened to the first variable in the combined dataset. What is the (new) name of the first variable in the combined output? Answer:

It turns out that both the cities dataset and the koppen dataset both contain a variable named “name”. In R, a dataset cannot contain two columns with the same names. By default, left_join adds a suffix to distinguish variables that came from the first dataset and those that came from the second dataset. I generally find the default suffix values to not be ideal because they change both the names in the left table (the one that contains the observations of interest) as well as the the new metadata columns. Let’s see how to change these suffix values.

The left_join function has an optional argument called suffix. You can supply two string values to it to describe what suffix values are placed on the first and second tables in the join. The default value is c(".x", ".y"). I typically find something with the first entry empty to be more practical. So, for example: c("", "_koppen"). In the code below, repeat the left join with this suffix option:

cities %>%
  left_join(koppen, by = c("koppen_code" = "code"), suffix = c("", "_koppen"))
## # A tibble: 81 x 28
##    name  country city_definition population city_pop city_area metro_pop
##    <chr> <chr>   <chr>                <dbl>    <dbl>     <dbl>     <dbl>
##  1 Tokyo Japan   Metropolis pre…       37.4    13.5       2191      37.3
##  2 Delhi India   National capit…       28.5    16.8       1484      29
##  3 Shan… China   Municipality          25.6    24.2       6341      NA
##  4 São … Brazil  Municipality          21.6    12.3       1521      21.7
##  5 Mexi… Mexico  City-state            21.6     8.92      1485      20.9
##  6 Cairo Egypt   Urban governor…       20.1     9.5       3085      NA
##  7 Mumb… India   Municipality          20.0    12.5        603      24.4
##  8 Beij… China   Municipality          19.6    21.7      16411      NA
##  9 Dhaka Bangla… Capital city          19.6    14.4        338      14.5
## 10 Osaka Japan   Designated city       19.3     2.72       225      19.3
## # … with 71 more rows, and 21 more variables: metro_area <dbl>,
## #   urban_pop <dbl>, urban_area <dbl>, wiki <chr>, country_code2 <chr>,
## #   country_code3 <chr>, country_name_official <chr>, continent <chr>,
## #   lon <dbl>, lat <dbl>, koppen_code <chr>, koppen_main <chr>, city <chr>,
## #   num <dbl>, cost_of_living <dbl>, cost_rent <dbl>, cost_groceries <dbl>,
## #   cost_restaurant <dbl>, local_pp <dbl>, name_koppen <chr>, color <chr>

Verify that the variable names match what you expected to find in the new dataset. The same suffix option can be used to modify the output for other table joins.

Below, pipe the joined dataset from the last question into a plot with longitude on the x-axis and latitude on the y-axis. Color the points based on the default Köppen codes using an identity color scale:

cities %>%
  left_join(koppen, by = c("koppen_code" = "code"), suffix = c("", "_koppen")) %>%
  ggplot(aes(lon, lat)) +
    geom_point(aes(color = color)) +
    scale_color_identity()

Creating join table

So far we have worked only with data loaded into R through csv or Excel files. It is also possible to directly create datasets within R code. This can be very useful for quickly creating small metadata tables, particularly as a way of cleaning and labelling data. In order to make a new dataset directly in R we use the tibble function, describing each variable with names followed by equals signs and the values contained within the function c(). For example, here is a dataset with a column giving the names of the five continents in our dataset next to the names of the continents in Georgian:

continent_names <- tibble(
  continent = c("Asia", "South America", "North America", "Africa", "Europe"),
  continent_ka = c("აზია", "სამხრეთი ამერიკა", "ჩრდილოეთი ამერიკა", "აფრიკა", "ევროპა")
)

continent_names
## # A tibble: 5 x 2
##   continent     continent_ka
##   <chr>         <chr>
## 1 Asia          აზია
## 2 South America სამხრეთი ამერიკა
## 3 North America ჩრდილოეთი ამერიკა
## 4 Africa        აფრიკა
## 5 Europe        ევროპა

In the code below, join the cities dataset with the continent_ka dataset. Pipe the data into a plot with longitude on the x-axis and latitude on the y-axis, coloring the points with the variable continent_ka.

cities %>%
  left_join(continent_names, by = "continent") %>%
  ggplot(aes(lon, lat)) +
    geom_point(aes(color = continent_ka))

The plot should now show the legend in Georgian. In the code below, repeat the procedure with a different language of your choice and recreate the plot with the new labels.

continent_names <- tibble(
  continent = c("Asia", "South America", "North America", "Africa", "Europe"),
  continent_fr = c("Asie", "Amérique du Sud", "Amérique du Nord", "Afrique", "Europe")
)

cities %>%
  left_join(continent_names, by = "continent") %>%
  ggplot(aes(lon, lat)) +
    geom_point(aes(color = continent_fr))

Note that these plots may not work locally on your own machine, depending on the character set you choose. They should, however, run on the Linux server running RStudio Cloud.

Manual colors with a legend

There is one problem with the plot that we created to color the cities by their Köppen codes. There’s no legend for the plot. A way to rectify this problem is to color the points using the variable name_koppen and use the colors in the koppen metadata table to describe a manual scale with scale_color_manual. Specifically, we need:

scale_color_manual(values = koppen$color, labels = koppen$name).

Below, redo your plot with this color value and scale.

cities %>%
  left_join(koppen, by = c("koppen_code" = "code"), suffix = c("", "_koppen")) %>%
  ggplot(aes(lon, lat)) +
    geom_point(aes(color = color)) +
    scale_color_manual(values = koppen$color, labels = koppen$name)