01. Working with R and RMarkdown

Running RMarkdown in RStudio

All of the code that we will work through this semester will be stored as RMarkdown files, which have a .Rmd extension. These files are great because they allow us mix code and descriptions within the same file. Reading these notes will give you brief overview of how this works; we will practice hands-on in class.

When opening an RMarkdown file in RStudio, you should see a window similar to this (it will be slightly different on Windows and depending on your screen size):

On the left is the actual file itself. Some output and other helpful bits of information are shown on the right. There is also a Console window, which we generally will not need. I have minimized it in the graphic.

Notice that the file has parts that are on a white background and other parts that are on a grey background. The white parts correspond to text and the grey parts to code. In order to run the code, and to see the output, click on the green play button on the right of each block.

When you run code to read or create a new dataset, the data will be listed in the Environment tab in the upper right hand side of RStudio:

Clicking on the data will open a spreadsheet version of the data that you can view to understand the structure of your data and to see all of the columns that are available for analysis:

Going back to the RMarkdown file by clicking on the tab on the upper row, we can see how graphics work in R. We have written some code to produce a scatter plot. When the code is run, the plot displays inside of the markdown file:

Make sure to save the notebook frequently. However, notice that only the text and code itself is saved. The results (plots, tables, and other output) are not automatically stored. This is actually helpful because the code is much smaller than the results and it helps to keep the file sizes small. If you would like to save the results in a way that can be shared with others, you need to knit the file by clicking on the Knit button (it has a ball of yarn icon) at the top of the notebook. After running all the code from scratch, it will produce an HTML version of our script that you can open in a web browser:

In fact, the notes that you are currently reading were created with RMarkdown files that are knitted to HTML.

Running R Code

We now want to give a very brief overview of how to run R code. We will now only show snippets of R code and the output rather than a screen shot of the entire RStudio session. Though, know that you should think of each of the snippets as occuring inside of one of the grey boxes in an RMarkdown file.

In one of its most basic forms, R can be used as a fancy calculator. For example, we can divide 12 by 4:

12 / 4

## [1] 3

We can also store values by creating new objects within R. To do this, use the <- (arrow) symbol. For example, we can create a new object called mynum with a value of 8 by:

mynum <- 3 + 5

We can now use our new object mynum exactly the same way that we we would use the number 8. For example, adding it to 1 to get the number nine:

mynum + 1

## [1] 9

Object names must start with a letter, but can also use underscores and periods. This semester, we will use only lowercase letters and underscores for object names. That makes it easier to read and easier to remember what you have called things.

Running functions

A function in R is something that takes a number of input values and returns an output value. Generally, a function will look something like this:

function_name(arg1 = input1, arg2 = input2)

Where arg1 and arg2 are the names of the inputs to the function (they are fixed) and input1 and input2 are the values that we will assign to them. The number of arguments is not always two, however. There may be any number of arguments, including zero. Also, there may be additional optional arguments that have default values that can be modified.

Let us look at an example function: seq. This function returns a sequence of numbers. We will can give the function two input arguments: the starting point from and the ending point to.

seq(from = 1, to = 7)

## [1] 1 2 3 4 5 6 7

The function returns a sequence of numbers starting from 1 and ending at 7 in increments of 1. The return values are shown (in this document) right below the code block. Note that you can also pass arguments by position, in which case we use the default ordering of the arguments. Here is the same code but without the names:

seq(1, 7)

## [1] 1 2 3 4 5 6 7

There is also an optional argument by that controls the spacing between each of the numbers. By default it is equal to 1, but we can change it to spread the point out by half spaces.

seq(from = 1, to = 7, by = 0.5)

##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

We will learn how to use numerous functions in the coming notes.

Loading data

In these notes we will be working with data that is stored in a tabular format. Here is an example of a tabular dataset of food types, which has nine rows and five columns. Each row tells us the nutritional properties contained in 100 grams of a particular type of food.

Every row of the dataset represents a particular object in our dataset, each of which we call an observation. In our food type example, each individual food corresponds to a specific observation:

The columns in a tabular dataset represent the measurements that we record for each observation. These measurements are called features. In our example dataset, we have five features which record the name of the food type, the food group that the food falls into, the number of calories in a 100g serving, the amount of sodium (mg) in a 100g serving, and the amount of vitamin A (as a percentage of daily recommended value) in a 100g serving.

A larger version of this dataset, with more food types and nutritional facts, is included in the course materials. We will make extensive use of this dataset in the following notes as a common example for creating visualizations, performing data manipulation, and building models. In order to read in the dataset we use a function called read_csv and pass it a description of where the file is located relative to where this script is stored. The data is called foods.csv and is stored in the folder data. The following code will load the foods dataset into R, save it as an object called food, and prints out the first several rows:

food <- read_csv(file = "../data/food.csv")
food

## # A tibble: 61 × 17
##    item     food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
##    <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 Apple    fruit        52     0.1   0.028       0      1 13.8    2.4 10.4 
##  2 Asparag… vegeta…      20     0.1   0.046       0      2  3.88   2.1  1.88
##  3 Avocado  fruit       160    14.6   2.13        0      7  8.53   6.7  0.66
##  4 Banana   fruit        89     0.3   0.112       0      1 22.8    2.6 12.2 
##  5 Chickpea grains      180     2.9   0.309       0    243 30.0    8.6  5.29
##  6 String … vegeta…      31     0.1   0.026       0      6  7.13   3.4  1.4 
##  7 Beef     meat        288    19.5   7.73       87    384  0      0    0   
##  8 Bell Pe… vegeta…      26     0     0.059       0      2  6.03   2    4.2 
##  9 Crab     fish         87     1     0.222      78    293  0.04   0    0   
## 10 Broccoli vegeta…      34     0.3   0.039       0     33  6.64   2.6  1.7 
## # … with 51 more rows, 7 more variables: protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>, and abbreviated variable names ¹food_group, ²calories,
## #   ³total_fat, ⁴cholesterol

Notice that the display shows that there are a total of 61 rows and 17 features. The first 10 rows and 10 columns are shown. At the bottom, the names of the additional feature names are given. As described above, if you run this RStudio, you can view a full tabular version of the dataset by clicking on the dataset name in the Environment tab. The abbreviations <chr> and <dbl> tell us which features are characters (item, food_type, wiki, description, and color) and which are numbers (all the others).

Many of the examples in the following notes will make use of this foods dataset to demonstrate new concepts. Another related dataset that will be also be useful for illustrating several concepts contains the prices of various food items for over 140 years. We can read it into R using similar block of code, namely:

food_prices <- read_csv(file = "../data/food_prices.csv")
food_prices

## # A tibble: 146 × 14
##     year   tea sugar peanuts coffee cocoa wheat   rye  rice  corn barley
##    <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1  1870  129.  151.    203.   88.1  78.8  88.1 103.   83.5 121.    103.
##  2  1871  132.  167.    222.  109.   66.7 118.  105.   84.5  88.4   130.
##  3  1872  134.  162.    189.  140.   71.6 122.  102.   92.9  69.2   125.
##  4  1873  136.  154.    179.  173.   65.8 116.  106.   91.0  67.1   166.
##  5  1874  146.  153.    231.  187.   69.9 113.  126.   99.6 128.    174.
##  6  1875  149.  150.    197.  176.   69.4 110.  116.   85.8 127.    161.
##  7  1876  150.  160.    172.  184.   80.7 114.  106.   95.3  91.2   132.
##  8  1877  149.  189.    153.  198.   87.8 144.   97.0 108.   94.5   125.
##  9  1878  150.  165.    160.  169.   96.0 115.   91.6 114.   82.2   121.
## 10  1879  144.  158.    133.  149.  108.  118.  113.  110.   78.7   124.
## # … with 136 more rows, and 3 more variables: pork <dbl>, beef <dbl>,
## #   lamb <dbl>

Here, each observation is a year. Features correspond to specific types of food. Notice that this is different than the foods dataset, in which the food items were observations.

Formatting

It is very important to properly format your code in a consistent way. Even though the code may run without errors and produce the desired results, it is extremely important to make sure that your code is well-formatted to make it easier to read and debug. We will follow the following guidelines:

always put one space before and after an equals sign or assignment arrow
always put one space after a comma, but no space before a comma
always put one space around mathematical operations (such as + and *)

It will make your life a lot easier if you get used to these rules right from the start. We will practice and review this in class.

Homework Questions

At at then end of each set of notes, such as this one, will be a short set of questions or activities to complete before the next class. Bring written solutions with you to class.

Make sure you have R, RStudio, and all of the packages installed. If you are still having trouble with anything, please let me know during class.
On a piece of paper, make an example of a tabular dataset with five rows and three columns. This can capture any type of information you would like. We will share these together in class.
Give each of the columns of your dataset names. Try to follow the variable name rules described above.

Once you have finished reading and completing the items above, make sure to submit the pre-class form.