Introduction

My statistical learning course (MATH389) is in many ways a natural extension of the way that I teach Introduction to Data Science (MATH289). I recognize that many students this semester, however, may have taken 289 with a different instructor in a previous year or have come through one of the alternative prerequisites. The course should be accessible to anyone with previous experience writing code to analyze data, provided that one is prepared to put in a bit of work at the start of the semester to fill in any gaps of knowledge.

These notes provide a starting point for getting up to speed with the specific material covered in 289. They give an overview of some of the R functions we will be working with this semester. The notes assume that you have had some previous exposure to writing code, but no particular exposure to R. For example, perhaps you have used Python in an introductory CS course of Matlab in a mathematics course. Even if you have used R in other courses, it is quite possible that you have not used the specific functions and approaches mentioned here, so please review these notes carefully.

There is no way that these notes will completely substitute for all of the detail that we go into in MATH289, Introduction to Data Science. However, they should give a good baseline from which to understand the code that we will develop throughout the course. For more information, I recommend that you check out:

I make references to both of these in several points throughout these notes as topics arise. I am always happy to answer questions about specific functions or approaches being used in the course notes.

Installation

The R programming language and all of the third-party packages that we will use during the semester is free and open-source. As a class, we will walk through the steps to install the language and needed components on your machine during the first course meeting. In the event of difficulties, we have a subscription to a cloud-based alternative that you are able to make use of. For now, I recommend just reading and understanding the code described here. There will be plenty of time to practice the material once the semester begins.

If the entire concept of writing and running scripting code seems foreign to you, I suggest reading chapters 1 and 4 of the R for Data science book, which is freely available here:

In the case that these chapters seem to be moving too fast, which is hopefully not the case, I recommend contacting me to assess whether 389 is an appropriate course based on your background.

Setup

Usually, the first we do when running R code is to load third-party extensions called packages. These provide additional functions that make working with data easier and more consistent than using build-in function. Here are three packages that I tend to use in my data analysis work:

library(tidyverse)
library(ggrepel)
library(smodels)

Next, we typically need to set some parameters that change the default values for common functions. Here are three that I also typically include, that change the way that plots, output, and data summaries are performed:

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)
options(width = 77L)

Typically in class, I will include all of the packages and default settings that you need for an analysis.

Loading and Viewing Datasets

After loading the R libraries and setting up our work space, the next step is to load in some data. Here, we load a csv file into R and store the data set as an R object named food. Notice that R uses the arrow sign (<-) to assign the output of one function to a variable name.

food <- read_csv(file.path("data", "food.csv"))

The food data set contains information about various food items, with one row for each item of food (we call these observations) and one column for each thing that we know about the food items (we call these features). Running a line of code with just the name of the data set prints out the first few rows and columns of the data. Additional feature names are given at the bottom of the output.

food
## # A tibble: 61 x 17
##    item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##    <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
##  1 Apple fruit            52       0.1   0.028           0      1 13.8    2.4
##  2 Aspa… vegetable        20       0.1   0.046           0      2  3.88   2.1
##  3 Avoc… fruit           160      14.6   2.13            0      7  8.53   6.7
##  4 Bana… fruit            89       0.3   0.112           0      1 22.8    2.6
##  5 Chic… grains          180       2.9   0.309           0    243 30.0    8.6
##  6 Stri… vegetable        31       0.1   0.026           0      6  7.13   3.4
##  7 Beef  meat            288      19.5   7.73           87    384  0      0  
##  8 Bell… vegetable        26       0     0.059           0      2  6.03   2  
##  9 Crab  fish             87       1     0.222          78    293  0.04   0  
## 10 Broc… vegetable        34       0.3   0.039           0     33  6.64   2.6
## # … with 51 more rows, and 8 more variables: sugar <dbl>, protein <dbl>,
## #   iron <dbl>, vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>,
## #   description <chr>, color <chr>

Notice that the each of the columns has a unique name. For this class, we will use the convention that all column and variable names use a combination of lower-case letters, underscores, and numbers. This will helps keep our code neat and easy to follow.

Manipulating Data

Data Verbs

A common task in data analysis is manipulating an existing data set. In R, we will often do this through the use of data verbs. These are functions that take one version of a data set and return a modified version of the data set. They always work on a copy of the data. Often we want to apply a sequence of data verbs one after another. To make this easy, and avoid the need to create temporary variables, we can use the pipe operator %>%, which passes the output of one line into the first argument of the next.

It is probably more intuitive to see an example of data verbs and the pipe operator, rather than trying to overly-describe the way it functions. As an example, here is a chain of operators that first filters the data to include only those rows where the food group various indicates a fruit and then **arranges* the data from the smallest to the largest value of sodium.

food %>%
  filter(food_group == "fruit") %>%
  arrange(sodium)
## # A tibble: 16 x 17
##    item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##    <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
##  1 Grap… fruit            32       0     0.014           0      0  8.08   1.1
##  2 Oran… fruit            47       0.1   0.015           0      0 11.8    2.4
##  3 Peach fruit            39       0.2   0.019           0      0  9.54   1.5
##  4 Plum  fruit            46       0.2   0.017           0      0 11.4    1.4
##  5 Apple fruit            52       0.1   0.028           0      1 13.8    2.4
##  6 Bana… fruit            89       0.3   0.112           0      1 22.8    2.6
##  7 Pear  fruit            58       0.1   0.006           0      1 15.5    3.1
##  8 Pine… fruit            48       0.1   0.009           0      1 12.6    1.4
##  9 Stra… fruit            32       0     0.015           0      1  7.68   2  
## 10 Grape fruit            69       0.1   0.054           0      2 18.1    0.9
## 11 Lemon fruit            29       0     0.039           0      2  9.32   2.8
## 12 Lime  fruit            30       0     0.022           0      2 10.5    2.8
## 13 Tang… fruit            53       0.3   0.039           0      2 13.3    1.8
## 14 Kiwi  fruit            61       0.5   0.029           0      3 14.7    3  
## 15 Avoc… fruit           160      14.6   2.13            0      7  8.53   6.7
## 16 Cant… fruit            34       0.1   0.051           0     16  8.16   0.9
## # … with 8 more variables: sugar <dbl>, protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>

Other common verbs include the mutate verb, which creates new variables as a function of the existing variables, and the select verb which selects a subset of the existing columns. Here, we compute the percentage of calories that are from fat (there are 9 calories in 1 gram of fat), arranges the data in descending order by the percentage of calories that are from fat, and then selects only the relevant variables.

food %>%
  mutate(calories_fat_perc = total_fat * 9 / calories * 100) %>%
  arrange(desc(calories_fat_perc)) %>%
  select(item, total_fat, calories, calories_fat_perc)
## # A tibble: 61 x 4
##    item       total_fat calories calories_fat_perc
##    <chr>          <dbl>    <dbl>             <dbl>
##  1 Sour Cream      20.9      214              87.9
##  2 Avocado         14.6      160              82.1
##  3 Cheese          26.9      350              69.2
##  4 Halibut         17.7      239              66.7
##  5 Lamb            20.7      292              63.8
##  6 Beef            19.5      288              60.9
##  7 Pork            17        271              56.5
##  8 Catfish         14.5      240              54.4
##  9 Chicken         13.4      237              50.9
## 10 Milk             3.2       60              48. 
## # … with 51 more rows

Inside of the filter function you can use a number of different logical operators, such as >, <, >=, <=, != (not equal), %in% (set containment), & (and), and | (or). And inside of the mutate function, an array of different mathematical operators can be applied, such as sqrt(), sin(), and abs().

Summarizing Data

Another data manipulation verb that deserves special attention is the summarize command. By default, it summarizes all of the rows of a data set in a single line by applying summary functions to columns of the data. For example, here is the code to take the average (mean) value of three of the variables in the food data:

food %>%
  summarize(sm_mean(calories), sm_mean(total_fat), sm_mean(sat_fat))
## # A tibble: 1 x 3
##   calories_mean total_fat_mean sat_fat_mean
##           <dbl>          <dbl>        <dbl>
## 1          114.           4.44         1.47

Notice that only the new variables are included in the output.

The real power of the summary function comes by grouping the data by one or more variables prior to running the summary command. The summary function will then be applied only within each unique value of the grouping variable(s), with one row for each unique value. Here is the code to compute the same summaries for each food group:

food %>%
  group_by(food_group) %>%
  summarize(sm_mean(calories), sm_mean(total_fat), sm_mean(sat_fat))
## # A tibble: 6 x 4
##   food_group calories_mean total_fat_mean sat_fat_mean
##   <chr>              <dbl>          <dbl>        <dbl>
## 1 dairy              181.          13.0         8.07  
## 2 fish               167.           7.22        1.47  
## 3 fruit               54.9          1.04        0.162 
## 4 grains             196.           2.56        0.421 
## 5 meat               234.          13.9         5.12  
## 6 vegetable           37.4          0.281       0.0693

There are number of other summary functions that we will occasionally need, such as sm_min(), sm_median, sm_quantiles(), sm_count(), and sm_cor(). It is also possible to use a format similar to the mutate function, where the summary function and new variable names are explicitly defined.

More information about data verbs and data grouping can be found in Chapter 5 of the R for Data Science textbook and in Notebook 4, Notebook 5, and Notebook 6 of my Introduction to Data Science course.

Combining Data: Two-table Verbs

All of the data verbs above work by taking a single data table as and input and returning a modified copy of the data as an output. The other class of data verbs that we will use allow us to combine information from two different data tables. These are called two-table verbs. To show an example of these, let’s load another dataset of recipes showing the ingredients for two dishes.

recipes <- read_csv(file.path("data", "food_recipes.csv"))
recipes
## # A tibble: 10 x 3
##    recipe    ingredient amount
##    <chr>     <chr>       <dbl>
##  1 Pot Roast Beef         1200
##  2 Pot Roast Carrot        400
##  3 Pot Roast Potato       1000
##  4 Pot Roast Onion         500
##  5 Pot Roast Tomato        200
##  6 Pot Roast Bay Leaf        5
##  7 Guacamole Avocado      1000
##  8 Guacamole Onion         500
##  9 Guacamole Tomato        500
## 10 Guacamole Lime          150

We might want to combine this data with the food data to see, for example, how many calories are in each dish. We do this with the inner_join function, which allows us to combine two data sets by joining along a common key variable (here, the food name).

recipes %>%
  inner_join(food, by = c("ingredient" = "item"))
## # A tibble: 9 x 19
##   recipe ingredient amount food_group calories total_fat sat_fat cholesterol
##   <chr>  <chr>       <dbl> <chr>         <dbl>     <dbl>   <dbl>       <dbl>
## 1 Pot R… Beef         1200 meat            288      19.5   7.73           87
## 2 Pot R… Carrot        400 vegetable        41       0.2   0.037           0
## 3 Pot R… Potato       1000 vegetable       104       2     0.458           0
## 4 Pot R… Onion         500 vegetable        42       0     0.026           0
## 5 Pot R… Tomato        200 vegetable        18       0     0.046           0
## 6 Guaca… Avocado      1000 fruit           160      14.6   2.13            0
## 7 Guaca… Onion         500 vegetable        42       0     0.026           0
## 8 Guaca… Tomato        500 vegetable        18       0     0.046           0
## 9 Guaca… Lime          150 fruit            30       0     0.022           0
## # … with 11 more variables: sodium <dbl>, carbs <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## #   vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>

All of the food nutritional facts are given for a 100g serving; the recipes give amounts in grams. With this knowledge, we can put together the verbs from the previous sections to compute the amount of calories in each dish:

recipes %>%
  inner_join(food, by = c("ingredient" = "item")) %>%
  mutate(calories_total = (calories / 100) * amount) %>%
  group_by(recipe) %>%
  summarize(sm_sum(calories_total))
## # A tibble: 2 x 2
##   recipe    calories_total_sum
##   <chr>                  <dbl>
## 1 Guacamole               1945
## 2 Pot Roast               4906

Another two-table verb is left_join, which works exactly the same but includes rows that only exist in the first table. Notice the difference here, with the row containing the bay leaf (which is not present in the food dataset) included in the output:

recipes %>%
  left_join(food, by = c("ingredient" = "item"))
## # A tibble: 10 x 19
##    recipe ingredient amount food_group calories total_fat sat_fat cholesterol
##    <chr>  <chr>       <dbl> <chr>         <dbl>     <dbl>   <dbl>       <dbl>
##  1 Pot R… Beef         1200 meat            288      19.5   7.73           87
##  2 Pot R… Carrot        400 vegetable        41       0.2   0.037           0
##  3 Pot R… Potato       1000 vegetable       104       2     0.458           0
##  4 Pot R… Onion         500 vegetable        42       0     0.026           0
##  5 Pot R… Tomato        200 vegetable        18       0     0.046           0
##  6 Pot R… Bay Leaf        5 <NA>             NA      NA    NA              NA
##  7 Guaca… Avocado      1000 fruit           160      14.6   2.13            0
##  8 Guaca… Onion         500 vegetable        42       0     0.026           0
##  9 Guaca… Tomato        500 vegetable        18       0     0.046           0
## 10 Guaca… Lime          150 fruit            30       0     0.022           0
## # … with 11 more variables: sodium <dbl>, carbs <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, iron <dbl>, vitamin_a <dbl>,
## #   vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>

There are also the variations right_join and full_join which include missing keys in the second data set and missing keys in both data sets, respectively. Finally, the functions semi_join and anti_join identify rows that have matching values in the two data sets, but do not actually perform any joining together of the data columns. These will be useful in some text analysis tasks.

More information about combining data sets based on common key variables can be found in Chapter 12 and Chapter 13 of the R for Data Science textbook and in Notebook 9 of my Introduction to Data Science course.

Visualization

Scatterplots

Another major task in data analysis is producing visualizations of data. For this, we will use a system called the Grammar of Graphics. It requires a bit of work to create simple plots, but can be extending in a logical way to capture almost any kind of plot you would want to make with your data.

To start, let’s see how to draw a scatter plot of our food data. Each row of the data will be draw as a dot, with the x-coordinate given by the sugar content of the food and the y-axis given by the number of calories in the food item. This requires specifying three elements in the grammar of graphics:

  • the data set (food)
  • the type of geometry (points: geom_point)
  • the x and y aesthetics (x: sugar, y: calories)

The syntax for doing this in R is:

food %>%
  ggplot() +
    geom_point(aes(x = sugar, y = calories))

We can specify additional aesthetics that describe the way the points are plotted by mapping these to other variables in the data. R will take care of the details for us. For example, we can specify that the color of the points should change based on the item’s food group:

food %>%
  ggplot() +
    geom_point(aes(x = sugar, y = calories, color = food_group))

Notice that R has figured out what colors to use and how to map them to each unique value of the food group variable. Aesthetics can also be assigned to different fixed values as follows (note that these arguments go outside of the aes() function):

food %>%
  ggplot() +
    geom_point(aes(x = sugar, y = calories), color = "salmon", size = 4)

Resources for additional geometry types, aesthetics, and ways of further customizing graphics are given at the end of the following section.

Layering Graphics

In order to make more complex plots, we can layer multiple geometries together by literally adding them together with the plus sign. For example, we can add a text-repel layer to the plot that labels some of the items with labels (the term repel indicates that the label will be made to avoid intersecting the point and other labels). This geometry requires specifying the label aesthetic to indicate which variable is used to provide the label.

food %>%
  ggplot() +
    geom_point(aes(x = sugar, y = calories)) +
    geom_text_repel(aes(x = sugar, y = calories, label = item))