All of the code that we will work through this semester will be
stored as RMarkdown files, which have a .Rmd
extension.
These files are great because they allow us mix code and descriptions
within the same file. Reading these notes will give you brief overview
of how this works; we will practice hands-on in class.
When opening an RMarkdown file in RStudio, you should see a window similar to this (it will be slightly different on Windows and depending on your screen size):
On the left is the actual file itself. Some output and other helpful bits of information are shown on the right. There is also a Console window, which we generally will not need. I have minimized it in the graphic.
Notice that the file has parts that are on a white background and other parts that are on a grey background. The white parts correspond to text and the grey parts to code. In order to run the code, and to see the output, click on the green play button on the right of each block.
When you run code to read or create a new dataset, the data will be listed in the Environment tab in the upper right hand side of RStudio:
Clicking on the data will open a spreadsheet version of the data that you can view to understand the structure of your data and to see all of the columns that are available for analysis:
Going back to the RMarkdown file by clicking on the tab on the upper row, we can see how graphics work in R. We have written some code to produce a scatter plot. When the code is run, the plot displays inside of the markdown file:
Make sure to save the notebook frequently. However, notice that only the text and code itself is saved. The results (plots, tables, and other output) are not automatically stored. This is actually helpful because the code is much smaller than the results and it helps to keep the file sizes small. If you would like to save the results in a way that can be shared with others, you need to knit the file by clicking on the Knit button (it has a ball of yarn icon) at the top of the notebook. After running all the code from scratch, it will produce an HTML version of our script that you can open in a web browser:
In fact, the notes that you are currently reading were created with RMarkdown files that are knitted to HTML.
We now want to give a very brief overview of how to run R code. We will now only show snippets of R code and the output rather than a screen shot of the entire RStudio session. Though, know that you should think of each of the snippets as occuring inside of one of the grey boxes in an RMarkdown file.
In one of its most basic forms, R can be used as a fancy calculator. For example, we can divide 12 by 4:
12 / 4
## [1] 3
We can also store values by creating new objects within R.
To do this, use the <-
(arrow) symbol. For example, we
can create a new object called mynum
with a value of
8
by:
<- 3 + 5 mynum
We can now use our new object mynum
exactly the same way
that we we would use the number 8. For example, adding it to 1 to get
the number nine:
+ 1 mynum
## [1] 9
Object names must start with a letter, but can also use underscores and periods. This semester, we will use only lowercase letters and underscores for object names. That makes it easier to read and easier to remember what you have called things.
A function in R is something that takes a number of input values and returns an output value. Generally, a function will look something like this:
function_name(arg1 = input1, arg2 = input2)
Where arg1
and arg2
are the names of the
inputs to the function (they are fixed) and input1
and
input2
are the values that we will assign to them. The
number of arguments is not always two, however. There may be any number
of arguments, including zero. Also, there may be additional optional
arguments that have default values that can be modified.
Let us look at an example function: seq
. This function
returns a sequence of numbers. We will can give the function two input
arguments: the starting point from
and the ending point
to
.
seq(from = 1, to = 7)
## [1] 1 2 3 4 5 6 7
The function returns a sequence of numbers starting from 1 and ending at 7 in increments of 1. The return values are shown (in this document) right below the code block. Note that you can also pass arguments by position, in which case we use the default ordering of the arguments. Here is the same code but without the names:
seq(1, 7)
## [1] 1 2 3 4 5 6 7
There is also an optional argument by
that controls the
spacing between each of the numbers. By default it is equal to 1, but we
can change it to spread the point out by half spaces.
seq(from = 1, to = 7, by = 0.5)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
We will learn how to use numerous functions in the coming notes.
In these notes we will be working with data that is stored in a tabular format. Here is an example of a tabular dataset of food types, which has nine rows and five columns. Each row tells us the nutritional properties contained in 100 grams of a particular type of food.
Every row of the dataset represents a particular object in our dataset, each of which we call an observation. In our food type example, each individual food corresponds to a specific observation:
The columns in a tabular dataset represent the measurements that we record for each observation. These measurements are called features. In our example dataset, we have five features which record the name of the food type, the food group that the food falls into, the number of calories in a 100g serving, the amount of sodium (mg) in a 100g serving, and the amount of vitamin A (as a percentage of daily recommended value) in a 100g serving.
A larger version of this dataset, with more food types and
nutritional facts, is included in the course materials. We will make
extensive use of this dataset in the following notes as a common example
for creating visualizations, performing data manipulation, and building
models. In order to read in the dataset we use a function called
read_csv
and pass it a description of where the file is
located relative to where this script is stored. The data is called
foods.csv
and is stored in the folder data
.
The following code will load the foods dataset into R, save it as an
object called food
, and prints out the first several
rows:
<- read_csv(file = "../data/food.csv")
food food
## # A tibble: 61 × 17
## item food_…¹ calor…² total…³ sat_fat chole…⁴ sodium carbs fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Apple fruit 52 0.1 0.028 0 1 13.8 2.4 10.4
## 2 Asparag… vegeta… 20 0.1 0.046 0 2 3.88 2.1 1.88
## 3 Avocado fruit 160 14.6 2.13 0 7 8.53 6.7 0.66
## 4 Banana fruit 89 0.3 0.112 0 1 22.8 2.6 12.2
## 5 Chickpea grains 180 2.9 0.309 0 243 30.0 8.6 5.29
## 6 String … vegeta… 31 0.1 0.026 0 6 7.13 3.4 1.4
## 7 Beef meat 288 19.5 7.73 87 384 0 0 0
## 8 Bell Pe… vegeta… 26 0 0.059 0 2 6.03 2 4.2
## 9 Crab fish 87 1 0.222 78 293 0.04 0 0
## 10 Broccoli vegeta… 34 0.3 0.039 0 33 6.64 2.6 1.7
## # … with 51 more rows, 7 more variables: protein <dbl>, iron <dbl>,
## # vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## # color <chr>, and abbreviated variable names ¹food_group, ²calories,
## # ³total_fat, ⁴cholesterol
Notice that the display shows that there are a total of 61 rows and
17 features. The first 10 rows and 10 columns are shown. At the bottom,
the names of the additional feature names are given. As described above,
if you run this RStudio, you can view a full tabular version of the
dataset by clicking on the dataset name in the Environment tab. The
abbreviations <chr>
and <dbl>
tell
us which features are characters (item
,
food_type
, wiki
, description
, and
color
) and which are numbers (all the others).
Many of the examples in the following notes will make use of this foods dataset to demonstrate new concepts. Another related dataset that will be also be useful for illustrating several concepts contains the prices of various food items for over 140 years. We can read it into R using similar block of code, namely:
<- read_csv(file = "../data/food_prices.csv")
food_prices food_prices
## # A tibble: 146 × 14
## year tea sugar peanuts coffee cocoa wheat rye rice corn barley
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1870 129. 151. 203. 88.1 78.8 88.1 103. 83.5 121. 103.
## 2 1871 132. 167. 222. 109. 66.7 118. 105. 84.5 88.4 130.
## 3 1872 134. 162. 189. 140. 71.6 122. 102. 92.9 69.2 125.
## 4 1873 136. 154. 179. 173. 65.8 116. 106. 91.0 67.1 166.
## 5 1874 146. 153. 231. 187. 69.9 113. 126. 99.6 128. 174.
## 6 1875 149. 150. 197. 176. 69.4 110. 116. 85.8 127. 161.
## 7 1876 150. 160. 172. 184. 80.7 114. 106. 95.3 91.2 132.
## 8 1877 149. 189. 153. 198. 87.8 144. 97.0 108. 94.5 125.
## 9 1878 150. 165. 160. 169. 96.0 115. 91.6 114. 82.2 121.
## 10 1879 144. 158. 133. 149. 108. 118. 113. 110. 78.7 124.
## # … with 136 more rows, and 3 more variables: pork <dbl>, beef <dbl>,
## # lamb <dbl>
Here, each observation is a year. Features correspond to specific
types of food. Notice that this is different than the foods
dataset, in which the food items were observations.
It is very important to properly format your code in a consistent way. Even though the code may run without errors and produce the desired results, it is extremely important to make sure that your code is well-formatted to make it easier to read and debug. We will follow the following guidelines:
+
and *
)It will make your life a lot easier if you get used to these rules right from the start. We will practice and review this in class.
At at then end of each set of notes, such as this one, will be a short set of questions or activities to complete before the next class. Bring written solutions with you to class.
Once you have finished reading and completing the items above, make sure to submit the pre-class form.