R Notebooks

This file is called an R Notebook. It is a mixture of text (like this) written in a format called markdown, and blocks of code that look like this:

2 + 2
## [1] 4

You can run code a block of code by click on the green arrow in the top-left of the code block. Try this on the block above; you should see the result of the addition show up as a result below the code.

Most of the notes this semester will be given in the form of a new notebook. During class we will work through the notes and embedded questions. Depending on the pace of the day, we may not finishing all of the questions in the notebook. You should try to finishing the remaining questions for homework before the next class meeting.

Today we will cover some of the basics of running R code and introducing the various parts of the RCloud software.

Running R Code

We now want to give a very brief overview of how to run R code. We will now only show snippets of R code and the output rather than a screen shot of the entire RStudio session. Though, know that you should think of each of the snippets as occurring inside of one of the grey boxes in an RMarkdown file.

In one of its most basic forms, R can be used as a fancy calculator. We already saw this above. Or, for example, we can divide 12 by 4:

12 / 4
## [1] 3

We can also store values by creating new objects within R. To do this, use the <- (arrow) symbol. For example, we can create a new object called mynum with a value of 8 by:

mynum <- 3 + 5

Notice that the number will also show up in the upper left hand corner of the RStudio window. We can now use our new object mynum exactly the same way that we we would use the number 8. For example, adding it to 1 to get the number nine:

mynum + 1
## [1] 9

As we work through course notes this semester, you will find questions scattered throughout the notebook. Try to answer these are we go along; occasionally I may call on someone to show and/or tell their answers. Here’s our first question. In the code block below, divide the number 57 by 3:

57 / 3
## [1] 19

Some questions will also be in the form of a short text response, usually followed with the word Answer, where you should put your response, like below.

Is 57 a prime number? Answer: No, because it is evenly divisible by 3.

Running functions

A function in R is something that takes a number of input values and returns an output value. Generally, a function will look something like this (Note: This is just an example. If you run the code you will see an error because it is not a real function.):

function_name(arg1 = input1, arg2 = input2)

Where arg1 and arg2 are the names of the inputs to the function (they are fixed) and input1 and input2 are the values that we will assign to them. The number of arguments is not always two, however. There may be any number of arguments, including zero. Also, there may be additional optional arguments that have default values that can be modified.

Let us look at an example function: seq. This function returns a sequence of numbers. We will can give the function two input arguments: the starting point from and the ending point to.

seq(from = 1, to = 7)
## [1] 1 2 3 4 5 6 7

The function returns a sequence of numbers starting from 1 and ending at 7 in increments of 1. Note that you can also pass arguments by position, in which case we use the default ordering of the arguments. Here is the same code but without the names:

seq(1, 7)
## [1] 1 2 3 4 5 6 7

There is also an optional argument by that controls the spacing between each of the numbers. By default it is equal to 1, but we can change it to spread the point out by half spaces.

seq(from = 1, to = 7, by = 0.5)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

We will learn how to use numerous functions throughout the semester.

Loading Data

The goal of this course is to learn how to work with data. Not surprisingly, we will learn a number of functions for loading data into R. In the next class we will introduce more of the details around organizing and loading data. Let’s just see a quick example for now. We will load a data set called foods.csv that is stored in the folder data. To start, we need to load the appropriate R library. (Note: Usually we will do this at the start of the notebook.)

library(tidyverse)

Next, the following code uses the function read_csv to load this data set into R, save it as an object called food, and prints out the first several rows.

food <- read_csv("data/food.csv")
food
## # A tibble: 61 x 17
##    item  food_group calories total_fat sat_fat cholesterol sodium carbs fiber
##    <chr> <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl> <dbl>
##  1 Apple fruit            52       0.1   0.028           0      1 13.8    2.4
##  2 Aspa… vegetable        20       0.1   0.046           0      2  3.88   2.1
##  3 Avoc… fruit           160      14.6   2.13            0      7  8.53   6.7
##  4 Bana… fruit            89       0.3   0.112           0      1 22.8    2.6
##  5 Chic… grains          180       2.9   0.309           0    243 30.0    8.6
##  6 Stri… vegetable        31       0.1   0.026           0      6  7.13   3.4
##  7 Beef  meat            288      19.5   7.73           87    384  0      0  
##  8 Bell… vegetable        26       0     0.059           0      2  6.03   2  
##  9 Crab  fish             87       1     0.222          78    293  0.04   0  
## 10 Broc… vegetable        34       0.3   0.039           0     33  6.64   2.6
## # … with 51 more rows, and 8 more variables: sugar <dbl>, protein <dbl>,
## #   iron <dbl>, vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>,
## #   description <chr>, color <chr>

You should also notice that the food data set appears in the upper left-hand corner of the screen. Clicking on it will open an Excel-like version of the data set. We will be working with these food data as a source of examples throughout the first few weeks of the class.

Formatting

It is very important to properly format your code in a consistent way. Even though the code may run without errors and produce the desired results, you will make your life easier by writing well-formatted code from the start. This makes it easier to read and debug in the future. We will follow the following guidelines:

In the code block below, I wrote some code that selects all of the fruits and vegetables and produces a scatter plot with all the vegetables in the data set. We will learn more about these functions code over the next few weeks. For now, just reformat the code.

veggie <- filter(food, food_group == "vegetable")
ggplot(veggie, aes(x = calories, y = total_fat)) + geom_point()

Try running the code to see how easily RStudio makes it to embed visualizations into RMarkdown notebooks.

Practice: Running R Code

That’s all we have today for new material. In the rest of the notebook are some further practice questions to see how well you understood the material.

In the code block below, make a variable named fav_number and set it equal to your favorite number. (Note, the quotes are not part of the variable name).

fav_number <- 8

In the code below, apply the function log10 to your favorite number. (Again, the quotes are not part of the function name).

log10(fav_number)
## [1] 0.90309

Note that the only point of the exercises here are to familiarize you with running code in R, creating variables, and applying functions. We will get to more interesting tasks soon!

Practice: Running Functions

R contains several functions for producing (pseudo)-random variables. These are useful, for example, when using R to run simulation models. For example the function runif selects (by default) a set of random numbers between 0 and 1. It takes one required argument named n, which indicates how many random numbers should be generated. In the code below, use the runif function to produce 100 random numbers. Verify that each time you run the code a different set of numbers is produced:

runif(n = 100)
##   [1] 0.498986716 0.560345379 0.251729940 0.483857148 0.565100029 0.525645933
##   [7] 0.578793901 0.728793423 0.903679360 0.505007163 0.375971330 0.372624371
##  [13] 0.905182353 0.555913002 0.506825859 0.935318328 0.088157046 0.417522871
##  [19] 0.597564549 0.466816837 0.500847098 0.124167966 0.497301519 0.929420822
##  [25] 0.012805841 0.492082793 0.572513413 0.913207399 0.561870852 0.153400058
##  [31] 0.359298880 0.802876559 0.379418243 0.047873885 0.262220815 0.240997005
##  [37] 0.459875060 0.564053860 0.260181518 0.003242990 0.443559594 0.336745947
##  [43] 0.486823811 0.841758571 0.517522222 0.668155450 0.612714650 0.170513926
##  [49] 0.521151946 0.691637477 0.800804621 0.501566800 0.656487092 0.220993425
##  [55] 0.431167705 0.854907054 0.594319841 0.009093626 0.767200788 0.894329919
##  [61] 0.097291972 0.294737179 0.983614419 0.346500337 0.515778457 0.851859728
##  [67] 0.781583854 0.432914234 0.330367066 0.091161182 0.291833219 0.320596619
##  [73] 0.824664715 0.904660614 0.252035325 0.926419848 0.538525501 0.317082525
##  [79] 0.295561965 0.333140126 0.744433366 0.288501257 0.352488936 0.633187238
##  [85] 0.591954740 0.289816062 0.778822205 0.068171908 0.537202618 0.296197950
##  [91] 0.525957356 0.023194122 0.853486036 0.397189126 0.646116794 0.812079563
##  [97] 0.092807167 0.701531979 0.834245612 0.369246149

The runif function also has two optional parameters. These are named min and max; they determine the lower and upper bounds from which random numbers should be generated. By default these are set to 0 and 1. In the code below, generate 100 random numbers between 50 and 100. Here, select the number of random numbers by position (that is, without the code n = 100).

runif(100, min = 50, max = 100)
##   [1] 67.19351 69.98160 56.22340 70.15617 73.80923 56.06514 51.53892 92.45850
##   [9] 50.16339 82.50818 92.51223 79.42758 80.51720 70.99775 72.58125 81.61902
##  [17] 78.57204 81.38750 89.49493 79.66719 69.96340 93.96411 59.26472 95.10669
##  [25] 50.11733 81.43524 83.69779 83.98599 56.54472 55.26004 79.04752 91.73477
##  [33] 59.01367 71.16307 66.79792 61.99470 81.35569 58.43207 69.00559 76.36547
##  [41] 85.87985 96.81033 72.31244 79.17749 75.04810 72.88894 92.84837 77.99496
##  [49] 50.58148 96.03606 97.56923 65.68149 80.75881 55.55574 57.54016 57.08628
##  [57] 72.17505 66.22104 56.95783 63.04934 53.81102 63.67812 74.53964 56.02761
##  [65] 81.13377 91.59396 74.95926 70.54071 94.83141 50.49554 65.23096 51.35603
##  [73] 57.52146 97.00550 96.10790 77.94434 63.96641 65.85839 70.95330 85.88886
##  [81] 91.49825 75.26337 70.78674 78.14380 74.67742 58.34167 63.01869 82.53334
##  [89] 55.72246 79.88705 89.35273 79.57801 91.27268 80.73424 63.52411 53.99482
##  [97] 69.13616 94.81249 92.78636 68.27807

A common concern for new R users when running the code above is the meaning behind the numbers in square brackets to the left of the output. These are not part of the output itself. Instead, they provide a counter telling you which result number appears just to the left of right of it. The first line will always start with [1] because it always starts with the first result. The second line’s number will depend on the width of your screen when you ran the code.

Practice: Largest Cities data set

We will make extensive use of a data set that has information about the largest cities (by population) in the world. To read in and print out this data set, run the following lines of code:

cities <- read_csv(file.path("data", "largest_cities.csv"))
cities
## # A tibble: 81 x 26
##    name  country city_definition population city_pop city_area metro_pop
##    <chr> <chr>   <chr>                <dbl>    <dbl>     <dbl>     <dbl>
##  1 Tokyo Japan   Metropolis pre…       37.4    13.5       2191      37.3
##  2 Delhi India   National capit…       28.5    16.8       1484      29  
##  3 Shan… China   Municipality          25.6    24.2       6341      NA  
##  4 São … Brazil  Municipality          21.6    12.3       1521      21.7
##  5 Mexi… Mexico  City-state            21.6     8.92      1485      20.9
##  6 Cairo Egypt   Urban governor…       20.1     9.5       3085      NA  
##  7 Mumb… India   Municipality          20.0    12.5        603      24.4
##  8 Beij… China   Municipality          19.6    21.7      16411      NA  
##  9 Dhaka Bangla… Capital city          19.6    14.4        338      14.5
## 10 Osaka Japan   Designated city       19.3     2.72       225      19.3
## # … with 71 more rows, and 19 more variables: metro_area <dbl>,
## #   urban_pop <dbl>, urban_area <dbl>, wiki <chr>, country_code2 <chr>,
## #   country_code3 <chr>, country_name_official <chr>, continent <chr>,
## #   lon <dbl>, lat <dbl>, koppen_code <chr>, koppen_main <chr>, city <chr>,
## #   num <dbl>, cost_of_living <dbl>, cost_rent <dbl>, cost_groceries <dbl>,
## #   cost_restaurant <dbl>, local_pp <dbl>

Looking at the data, try to answer the following questions.

How many rows are in the data set? Answer: 81 cities.

What are the observations in the data set. Answer: Each observation is a city.

List three of the variables in the data set. Answer: name, country, and population are three examples.

The population variables are given in millions of people. How many people live in all of Tokyo? Answer: 37.4 million

How people live in Atlanta (Note: you will have to open the data explorer to see this data)? Answer: 5.57 million

Practice: Formatting

In the code block below, I wrote some code to add a new column to the data set that describes the population density for each city (1000s of people per square kilometer) and sorts from the most dense to the least dense. We will learn this code over the next 6 chapters. For now, I want to focus on formatting the code correctly. I did not include any spaces! Put the correct spaces into the code to make it match the style guide given in the notes.

new_data <- mutate(cities, city_density = city_pop / city_area * 1000)
new_data <- arrange(new_data, desc(city_density))
select(new_data, name, country, city_density)
## # A tibble: 81 x 3
##    name         country     city_density
##    <chr>        <chr>              <dbl>
##  1 Shenzhen     China               61.1
##  2 Dhaka        Bangladesh          42.6
##  3 Metro_Manila Philippines         41.4
##  4 Karachi      Pakistan            39.4
##  5 Kolkata      India               21.9
##  6 Mumbai       India               20.7
##  7 Paris        France              20.5
##  8 Luanda       Angola              18.7
##  9 Seoul        South Korea         16.2
## 10 Barcelona    Spain               16.0
## # … with 71 more rows

Make sure to run the code after you are done. What city in the data set has, on average, the most people per square kilometer? Answer: Shenzhen, China.

What next?

Hopefully we were able to finish these notes in class together, if not try to finish them on your own. However, note that there is no need to hand these in. On the course website I will post solutions to each of the notebooks. Usually this will follow the course meeting, but in the case of this first week I posted them ahead of time. If you still have questions, bring them to the course office hours or to our next class meeting!