My statistical learning course (MATH389) is in many ways a natural extension of the way that I teach Introduction to Data Science (MATH289). I recognize that many students this semester, however, may have taken 289 with a different instructor in a previous year or have come through one of the alternative prerequisites. The course should be accessible to anyone with previous experience writing code to analyze data, provided that one is prepared to put in a bit of work at the start of the semester to fill in any gaps of knowledge.
These notes provide a starting point for getting up to speed with the specific material covered in 289. They give an overview of some of the R functions we will be working with this semester. The notes assume that you have had some previous exposure to writing code, but no particular exposure to R. For example, perhaps you have used Python in an introductory CS course of Matlab in a mathematics course. Even if you have used R in other courses, it is quite possible that you have not used the specific functions and approaches mentioned here, so please review these notes carefully.
There is no way that these notes will completely substitute for all of the detail that we go into in MATH289, Introduction to Data Science. However, they should give a good baseline from which to understand the code that we will develop throughout the course. For more information, I recommend that you check out:
I make references to both of these in several points throughout these notes as topics arise. I am always happy to answer questions about specific functions or approaches being used in the course notes.
The R programming language and all of the third-party packages that we will use during the semester is free and open-source. As a class, we will walk through the steps to install the language and needed components on your machine during the first course meeting. In the event of difficulties, we have a subscription to a cloud-based alternative that you are able to make use of. For now, I recommend just reading and understanding the code described here. There will be plenty of time to practice the material once the semester begins.
If the entire concept of writing and running scripting code seems foreign to you, I suggest reading chapters 1 and 4 of the R for Data science book, which is freely available here:
In the case that these chapters seem to be moving too fast, which is hopefully not the case, I recommend contacting me to assess whether 389 is an appropriate course based on your background.
Usually, the first we do when running R code is to load third-party extensions called packages. These provide additional functions that make working with data easier and more consistent than using build-in function. Here are three packages that I tend to use in my data analysis work:
library(tidyverse) library(ggrepel) library(smodels)
Next, we typically need to set some parameters that change the default values for common functions. Here are three that I also typically include, that change the way that plots, output, and data summaries are performed:
theme_set(theme_minimal()) options(dplyr.summarise.inform = FALSE) options(width = 77L)
Typically in class, I will include all of the packages and default settings that you need for an analysis.
After loading the R libraries and setting up our work space, the next step is to load in some data. Here, we load a csv file into R and store the data set as an R object named
food. Notice that R uses the arrow sign (
<-) to assign the output of one function to a variable name.
<- read_csv(file.path("data", "food.csv"))food
The food data set contains information about various food items, with one row for each item of food (we call these observations) and one column for each thing that we know about the food items (we call these features). Running a line of code with just the name of the data set prints out the first few rows and columns of the data. Additional feature names are given at the bottom of the output.
## # A tibble: 61 x 17 ## item food_group calories total_fat sat_fat cholesterol sodium carbs fiber ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Apple fruit 52 0.1 0.028 0 1 13.8 2.4 ## 2 Aspa… vegetable 20 0.1 0.046 0 2 3.88 2.1 ## 3 Avoc… fruit 160 14.6 2.13 0 7 8.53 6.7 ## 4 Bana… fruit 89 0.3 0.112 0 1 22.8 2.6 ## 5 Chic… grains 180 2.9 0.309 0 243 30.0 8.6 ## 6 Stri… vegetable 31 0.1 0.026 0 6 7.13 3.4 ## 7 Beef meat 288 19.5 7.73 87 384 0 0 ## 8 Bell… vegetable 26 0 0.059 0 2 6.03 2 ## 9 Crab fish 87 1 0.222 78 293 0.04 0 ## 10 Broc… vegetable 34 0.3 0.039 0 33 6.64 2.6 ## # … with 51 more rows, and 8 more variables: sugar <dbl>, protein <dbl>, ## # iron <dbl>, vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, ## # description <chr>, color <chr>
Notice that the each of the columns has a unique name. For this class, we will use the convention that all column and variable names use a combination of lower-case letters, underscores, and numbers. This will helps keep our code neat and easy to follow.
A common task in data analysis is manipulating an existing data set. In R, we will often do this through the use of data verbs. These are functions that take one version of a data set and return a modified version of the data set. They always work on a copy of the data. Often we want to apply a sequence of data verbs one after another. To make this easy, and avoid the need to create temporary variables, we can use the pipe operator
%>%, which passes the output of one line into the first argument of the next.
It is probably more intuitive to see an example of data verbs and the pipe operator, rather than trying to overly-describe the way it functions. As an example, here is a chain of operators that first filters the data to include only those rows where the food group various indicates a fruit and then **arranges* the data from the smallest to the largest value of sodium.
%>% food filter(food_group == "fruit") %>% arrange(sodium)
## # A tibble: 16 x 17 ## item food_group calories total_fat sat_fat cholesterol sodium carbs fiber ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Grap… fruit 32 0 0.014 0 0 8.08 1.1 ## 2 Oran… fruit 47 0.1 0.015 0 0 11.8 2.4 ## 3 Peach fruit 39 0.2 0.019 0 0 9.54 1.5 ## 4 Plum fruit 46 0.2 0.017 0 0 11.4 1.4 ## 5 Apple fruit 52 0.1 0.028 0 1 13.8 2.4 ## 6 Bana… fruit 89 0.3 0.112 0 1 22.8 2.6 ## 7 Pear fruit 58 0.1 0.006 0 1 15.5 3.1 ## 8 Pine… fruit 48 0.1 0.009 0 1 12.6 1.4 ## 9 Stra… fruit 32 0 0.015 0 1 7.68 2 ## 10 Grape fruit 69 0.1 0.054 0 2 18.1 0.9 ## 11 Lemon fruit 29 0 0.039 0 2 9.32 2.8 ## 12 Lime fruit 30 0 0.022 0 2 10.5 2.8 ## 13 Tang… fruit 53 0.3 0.039 0 2 13.3 1.8 ## 14 Kiwi fruit 61 0.5 0.029 0 3 14.7 3 ## 15 Avoc… fruit 160 14.6 2.13 0 7 8.53 6.7 ## 16 Cant… fruit 34 0.1 0.051 0 16 8.16 0.9 ## # … with 8 more variables: sugar <dbl>, protein <dbl>, iron <dbl>, ## # vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>, ## # color <chr>
Other common verbs include the mutate verb, which creates new variables as a function of the existing variables, and the select verb which selects a subset of the existing columns. Here, we compute the percentage of calories that are from fat (there are 9 calories in 1 gram of fat), arranges the data in descending order by the percentage of calories that are from fat, and then selects only the relevant variables.
%>% food mutate(calories_fat_perc = total_fat * 9 / calories * 100) %>% arrange(desc(calories_fat_perc)) %>% select(item, total_fat, calories, calories_fat_perc)
## # A tibble: 61 x 4 ## item total_fat calories calories_fat_perc ## <chr> <dbl> <dbl> <dbl> ## 1 Sour Cream 20.9 214 87.9 ## 2 Avocado 14.6 160 82.1 ## 3 Cheese 26.9 350 69.2 ## 4 Halibut 17.7 239 66.7 ## 5 Lamb 20.7 292 63.8 ## 6 Beef 19.5 288 60.9 ## 7 Pork 17 271 56.5 ## 8 Catfish 14.5 240 54.4 ## 9 Chicken 13.4 237 50.9 ## 10 Milk 3.2 60 48. ## # … with 51 more rows
Inside of the filter function you can use a number of different logical operators, such as
!= (not equal),
%in% (set containment),
& (and), and
| (or). And inside of the mutate function, an array of different mathematical operators can be applied, such as
Another data manipulation verb that deserves special attention is the summarize command. By default, it summarizes all of the rows of a data set in a single line by applying summary functions to columns of the data. For example, here is the code to take the average (mean) value of three of the variables in the food data:
%>% food summarize(sm_mean(calories), sm_mean(total_fat), sm_mean(sat_fat))
## # A tibble: 1 x 3 ## calories_mean total_fat_mean sat_fat_mean ## <dbl> <dbl> <dbl> ## 1 114. 4.44 1.47
Notice that only the new variables are included in the output.
The real power of the summary function comes by grouping the data by one or more variables prior to running the summary command. The summary function will then be applied only within each unique value of the grouping variable(s), with one row for each unique value. Here is the code to compute the same summaries for each food group:
%>% food group_by(food_group) %>% summarize(sm_mean(calories), sm_mean(total_fat), sm_mean(sat_fat))
## # A tibble: 6 x 4 ## food_group calories_mean total_fat_mean sat_fat_mean ## <chr> <dbl> <dbl> <dbl> ## 1 dairy 181. 13.0 8.07 ## 2 fish 167. 7.22 1.47 ## 3 fruit 54.9 1.04 0.162 ## 4 grains 196. 2.56 0.421 ## 5 meat 234. 13.9 5.12 ## 6 vegetable 37.4 0.281 0.0693
There are number of other summary functions that we will occasionally need, such as
sm_cor(). It is also possible to use a format similar to the mutate function, where the summary function and new variable names are explicitly defined.
More information about data verbs and data grouping can be found in Chapter 5 of the R for Data Science textbook and in Notebook 4, Notebook 5, and Notebook 6 of my Introduction to Data Science course.
All of the data verbs above work by taking a single data table as and input and returning a modified copy of the data as an output. The other class of data verbs that we will use allow us to combine information from two different data tables. These are called two-table verbs. To show an example of these, let’s load another dataset of recipes showing the ingredients for two dishes.
<- read_csv(file.path("data", "food_recipes.csv")) recipes recipes
## # A tibble: 10 x 3 ## recipe ingredient amount ## <chr> <chr> <dbl> ## 1 Pot Roast Beef 1200 ## 2 Pot Roast Carrot 400 ## 3 Pot Roast Potato 1000 ## 4 Pot Roast Onion 500 ## 5 Pot Roast Tomato 200 ## 6 Pot Roast Bay Leaf 5 ## 7 Guacamole Avocado 1000 ## 8 Guacamole Onion 500 ## 9 Guacamole Tomato 500 ## 10 Guacamole Lime 150
We might want to combine this data with the food data to see, for example, how many calories are in each dish. We do this with the
inner_join function, which allows us to combine two data sets by joining along a common key variable (here, the food name).
%>% recipes inner_join(food, by = c("ingredient" = "item"))
## # A tibble: 9 x 19 ## recipe ingredient amount food_group calories total_fat sat_fat cholesterol ## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Pot R… Beef 1200 meat 288 19.5 7.73 87 ## 2 Pot R… Carrot 400 vegetable 41 0.2 0.037 0 ## 3 Pot R… Potato 1000 vegetable 104 2 0.458 0 ## 4 Pot R… Onion 500 vegetable 42 0 0.026 0 ## 5 Pot R… Tomato 200 vegetable 18 0 0.046 0 ## 6 Guaca… Avocado 1000 fruit 160 14.6 2.13 0 ## 7 Guaca… Onion 500 vegetable 42 0 0.026 0 ## 8 Guaca… Tomato 500 vegetable 18 0 0.046 0 ## 9 Guaca… Lime 150 fruit 30 0 0.022 0 ## # … with 11 more variables: sodium <dbl>, carbs <dbl>, fiber <dbl>, ## # sugar <dbl>, protein <dbl>, iron <dbl>, vitamin_a <dbl>, ## # vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>
All of the food nutritional facts are given for a 100g serving; the recipes give amounts in grams. With this knowledge, we can put together the verbs from the previous sections to compute the amount of calories in each dish:
%>% recipes inner_join(food, by = c("ingredient" = "item")) %>% mutate(calories_total = (calories / 100) * amount) %>% group_by(recipe) %>% summarize(sm_sum(calories_total))
## # A tibble: 2 x 2 ## recipe calories_total_sum ## <chr> <dbl> ## 1 Guacamole 1945 ## 2 Pot Roast 4906
Another two-table verb is
left_join, which works exactly the same but includes rows that only exist in the first table. Notice the difference here, with the row containing the bay leaf (which is not present in the food dataset) included in the output:
%>% recipes left_join(food, by = c("ingredient" = "item"))
## # A tibble: 10 x 19 ## recipe ingredient amount food_group calories total_fat sat_fat cholesterol ## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Pot R… Beef 1200 meat 288 19.5 7.73 87 ## 2 Pot R… Carrot 400 vegetable 41 0.2 0.037 0 ## 3 Pot R… Potato 1000 vegetable 104 2 0.458 0 ## 4 Pot R… Onion 500 vegetable 42 0 0.026 0 ## 5 Pot R… Tomato 200 vegetable 18 0 0.046 0 ## 6 Pot R… Bay Leaf 5 <NA> NA NA NA NA ## 7 Guaca… Avocado 1000 fruit 160 14.6 2.13 0 ## 8 Guaca… Onion 500 vegetable 42 0 0.026 0 ## 9 Guaca… Tomato 500 vegetable 18 0 0.046 0 ## 10 Guaca… Lime 150 fruit 30 0 0.022 0 ## # … with 11 more variables: sodium <dbl>, carbs <dbl>, fiber <dbl>, ## # sugar <dbl>, protein <dbl>, iron <dbl>, vitamin_a <dbl>, ## # vitamin_c <dbl>, wiki <chr>, description <chr>, color <chr>
There are also the variations
full_join which include missing keys in the second data set and missing keys in both data sets, respectively. Finally, the functions
anti_join identify rows that have matching values in the two data sets, but do not actually perform any joining together of the data columns. These will be useful in some text analysis tasks.
More information about combining data sets based on common key variables can be found in Chapter 12 and Chapter 13 of the R for Data Science textbook and in Notebook 9 of my Introduction to Data Science course.
Another major task in data analysis is producing visualizations of data. For this, we will use a system called the Grammar of Graphics. It requires a bit of work to create simple plots, but can be extending in a logical way to capture almost any kind of plot you would want to make with your data.
To start, let’s see how to draw a scatter plot of our food data. Each row of the data will be draw as a dot, with the x-coordinate given by the sugar content of the food and the y-axis given by the number of calories in the food item. This requires specifying three elements in the grammar of graphics:
The syntax for doing this in R is:
%>% food ggplot() + geom_point(aes(x = sugar, y = calories))
We can specify additional aesthetics that describe the way the points are plotted by mapping these to other variables in the data. R will take care of the details for us. For example, we can specify that the color of the points should change based on the item’s food group:
%>% food ggplot() + geom_point(aes(x = sugar, y = calories, color = food_group))
Notice that R has figured out what colors to use and how to map them to each unique value of the food group variable. Aesthetics can also be assigned to different fixed values as follows (note that these arguments go outside of the
%>% food ggplot() + geom_point(aes(x = sugar, y = calories), color = "salmon", size = 4)
Resources for additional geometry types, aesthetics, and ways of further customizing graphics are given at the end of the following section.
In order to make more complex plots, we can layer multiple geometries together by literally adding them together with the plus sign. For example, we can add a text-repel layer to the plot that labels some of the items with labels (the term repel indicates that the label will be made to avoid intersecting the point and other labels). This geometry requires specifying the label aesthetic to indicate which variable is used to provide the label.
%>% food ggplot() + geom_point(aes(x = sugar, y = calories)) + geom_text_repel(aes(x = sugar, y = calories, label = item))