Class 02: Sleepy Mammals and Sea Snails
Now, I want to throw you all into the format for the class and get you started with how assignments will work. This document, as well as all of your assignments, is written in a format called Rmarkdown. At the moment we are looking at the source code; if you are familiar with markdown, it is just a specific flavor of the basic standard.
Plain markdown is just a very simple format for marking up
texts. Unlike writing text in a word editor, there is nothing
hidden in a markdown file. If you want something to be in
italics, surround it in single stars like this. For bold,
use two. Links use the following format:
site name. If you want to make
something look like code in the output, you can enclose
text in back ticks:
We can turn markdown into a number of other formats. If I hit the preview button above, it will display what the HTML version of the document looks like. I’ll usually export this HTML version and put it on the class website as it is slightly cleaner than the raw format here.
The most powerful feature of Rmarkdown is that we can intermix code into the document and actually run it in real time. To do this, we enclose code blocks with three back ticks, and preface the first one with r in squiggly brackets. Every code block will get run in sequence if when we hit the Preview or Knit buttons, but we can also run just this block by hitting the play button to the right of the block. Here, let’s add 1 and 1 together:
The output is written directly to the document. This is the reason I’ll often use the raw format in class; it allows us to modify the code and see the changes as they run.
Note that there are shortcuts for running a chunk of code. To see them for your platform, look at the menu under the Run button.
In addition to the basic R functions that exist on start-up,
there are thousands of user-contributed packages the implement
various add-ons. To install these packages, we use the function
Once a package is installed, we also need to load it. While
installing the package only needs to be done once, we have to
load it each and every time we restart R (notice up above that
I included the option
eval = FALSE in the code block so that
my computer does not constantly reinstall it):
Once loaded, we can run commands from the readr package. We will do this in just a moment.
Let’s load in four libraries that will be important for us throughout the semester. Note that I’ll turn off messages as the packages produce quite a bit of verbose output that we do not need to worry about. Also note that re-loading the readr package has no ill-effects.
Next, let’s load in a small dataset to work with today. This data consists of the average number of hours per day that various species are awake. We will read the data set, like most of those in this class, directly from my website:
Once this code has been run, you should see the dataset pop up in the upper right hand corner of the screen. Clicking on it opens up a spreedsheet-like view of the data.
Particularly important are the first three columns, as most datasets that we work with this semester will have these columns as well. The meaning
- the first column is just an id; don’t lose it!
- the second indicates whether this is a sample where you know the response or if have to predict the response
- the third column gives the response of interest; it is missing whenever the second column is equal to “test”
We will use the function
qplot to produce simple plots
based on our data. In its most basic form, we give
the name of the variable we want to plot and the name of
the dataset in which the variable resides.
The function is quite smart and will choose by default the most appropriate plot to use. Here it is creating a histogram because we gave it one numeric variable.
qplot to a categorical variable gives a bar plot:
We can also give
qplot two variable. For instance, if we
use two continuous variables the function will construct a
The missing values that we are warned about are those points where we need to produce predictions.
There is not a special name for the plot with a continuous variable plotted versus a categorical one, but the point should be clear (not too different from a scatter plot):
Finally, one can plot two categorical variables. It is often not useful, but here it does help to understand the relationship between order and diet:
qplot function comes from the ggplot2 package. In
ggplot2, plots can be modified by changing what are known
as aesthetic mappings. Common aesthetics are: color, size, and
alpha (the opacity of the points). To change any of these to
a fixed value for all points, simply use the
in the call to
Alternatively, we can map aesthetics to variables in our dataset.
For example, let’s change the color to a function of the diet
type of each mammal (note that here we do not use the
Color behaves differently if we use a discrete variable or a numeric one. Notice here how a color gradient is used in place of a fixed scale:
This last plot is particularly hard to read on the projector. It is also difficult on my screen and for anyone with any degree of color-blindness. The package viridis can help. Let’s load it in (if you want to run this line, make sure to install the package first!):
We now literally add a layer to the plot describing what color scale we would like to make use of:
We will make use of these aesthetics throughout the semester to explore datasets. In particular, the color mapping allows us to look at more than two variables at the same time.
It seems like there is a positive relationship between body
weight and hours spent awake. We can add a best-fit line to
the plot to model this relationship. Adding things to a
qplot graphic literally involves using the
+ sign and
adding the results of other functions.
Here we use
with an option to make the smooth strictly linear:
We can fit the exact some model analytically, rather than graphically,
by calling the
lm function directly. Here we use an R formula: the
response variable (thing we want to predict), followed by the
sign, followed by the predictor variable. As with the graphics command,
we need to indicate which dataset is being used. The output shows
the slope and intercept of the best-fit-line.
predict will give us the predicted values implied
by this line:
Accessing and adding new variables
Columns of the dataset
msleep can be referenced by
using the dataset name followed by a dollar sign and
the name of the variable. So, for example, here are the
hours spent awake:
NA’s correspond to the missing values we want to predict.
Similarly, we can add a new variable by simply referencing it
$ and assigning it to something. Here, we add those
predictions back into the
Click on the msleep dataset again to verify that there is a new column of predictions. Notice that these are filled in for all values.
Now, we use the
select function to pick just a few
msleep and save this as a new dataset.
The select function takes the name of the dataset first,
followed by whatever variable names we want to keep. Here
we include just the observation id and our prediction:
Look at the
submit dataset and verify that it is what
you would expect.
write_csv function will save a dataset
to a file. Each of the columns will be separated by a
comma (that’s what the ‘c’ stands for in csv) and the
rows will be separated by a new line:
This is the kind of file I expect you to produce for each class assignment. Notice that we’ve created predictions for all of the data, even though we already know some of the values. It is often easier to do this and avoids the mistake of predicting on the wrong set. Of course, I’ll only grade you on the set that you were not able to see.
Now you should be familiar with the basics of using R to read data, create visualizations, and produce predictions. Let’s go through and show how to put these together to produce and submit your daily labs. I will do this interactively, but here are the steps in case you are returning to these notes at a later time:
- I will send a link to your e-mail that you should follow and accept. Assuming you have a valid GitHub account, this will set up a repository where all of your labs for this semester should be posted.
- Select the lab you would like to work on (
lab01.Rmdhere) and click on the link.
- Now, click on the Raw button. This should take you to the raw notebook file. Download this through your browser somewhere on your computer.
- Open this file in RStudio and work through the questions.
- When you are finished, select the
Knit to HTMLbutton. This should create an html file in the same location that you saved the Rmd file. You should also have a file
class01_submit.csvin the same location.
- Return to GitHub and drag and drop the three files into your repository. Commit these to the repository with the Commit changes button and then you are done! If dragging and dropping does not work (only officially support on Chrome I believe), select Upload files and do so manually.
- There should already be a
lab02.Rmdfor you to work on for the next class. New labs will show up automatically as we proceed through the semester.
Remember that labs are due online 90 minutes before class starts (noon on Tuesdays and Thursdays).