R is an open-source programming language. Under the hood, it is similar to other scripting languages such as Python, but it was originally designed specifically with the goal of data analysis in mind. Because of this, R has become the preferred language for statisticians. The document here is written in a specific type of R script called an R Markdown document. These documents have a mixture of plain text (what you are reading here), R code, and output. You can read the raw form of the file in any text editor, or, for better functionality, you can load it into the open-source RStudio IDE. The markdown file can be converted into a number of output formats for easy reading. If you found this file linked to from the course website and are reading it in your browser, you’re probably reading the HTML output of the file. The source code can also be downloaded from the site.
In courses such as DSST289 and DSST389, we go through a lot of detail about how to use R and various third-party packages to do increasingly complex analyses with data. In DSST330 we also go into more detail about how to run various models on data, though we will mostly work with data that is relatively well-structured for the analysis at hand. In this notebook for Probability theory, we will just go through some base functions for using R as a tool in computing and simulating quantities that we have studied throughout the semester.
One thing that R is good at for probability theory is to use it as a
very powerful calculator that (1) has access to the special functions we
need for probability theory and (2) we can write and check long
equations without breaking them into various error-prone sub-steps. For
example, we can use the choose
function to get binomial
coefficients. So, what’s the probability that a sequence of five dice
rolls will contain exactly two 1’s? The probability of a 1 is 1/6, so
the probability can be written, as you hopeful now know, as:
## [1] 0.160751
R also has handy functions such as factorial
and
lfactorial
(log factorial) that would be helpful for
computing the kinds of counting questions we did in the first two units
of the semester. It even has an implementation of the Gamma function and
constants such as pi
. Here, for example, we see that the
Gamma(1/2) is (up to rounding error) equal to the square-root of pi:
## [1] 2.220446e-16
R outputs the result in scientific notation, with the difference equal to something very close to zero (if you are taking the numerical analysis class next semester, you’ll study questions about why this is not exactly zero and estimate how small would be considered small enough to not worry about).
In addition to basic mathematical operations, R also has specialized
functions for working with the standard probability distributions that
we have studied throughout the semester. Each distribution has four
functions, defined by the first letter of the function name: (1) the
pdf/pmf defined by a function starting with
d_
, (2) the cdf defined by a function
starting with p_
(think: probability), (3) the
inverse cdf defined by a function starting with
q_
(stands for quantile), and (4) a function to produce
random draws from the distribution that starts with r_
.
Let’s see some examples. The die question above that we computed is
equivalent to the density of the binomial distribution with
p=1/6
and n=5
at the value of
x=2
. We can compute this directly with the following:
## [1] 0.160751
And this is the exact same as the result we had before. For the cdf, let’s compute the probability that standard normal is less than 2. Note that the R functions use the square root of the variance (the standard deviation) as an input.
## [1] 0.9772499
Similarly, here is the code to find the cut-off value that we used in Handout 19. We are looking for the value of the inverse cdf that corresponds (due to symmetry) to 1 minus 0.025.
## [1] 1.959964
You can change the cut-off probability of 0.05 to any other value you would like to find the corresponding critical value. Finally, and perhaps more importantly for the result of this notebook, we can use the random variation of the function to generate a set of random samples from a given distribution. Here are 20 random samples from a standard normal distribution:
## [1] 0.96218902 -0.16938049 -0.34585988 2.00298448 -0.05596546 0.73990382
## [7] -0.63530110 0.84586510 0.43550096 -1.49850883 -0.42702047 0.52745957
## [13] 0.39949524 -0.28949487 1.18324418 -1.55212642 0.22813656 -1.02341080
## [19] 1.74146000 0.99899406
We can save the output of a random sample and then plot the histogram
to visually compare it to a theoretical density function. Here is the
code to verify that the distribution of values from rnorm
matches the density from dnorm
:
x <- rnorm(1e7, mean = 0, sd = 1)
hist(x, breaks = 50, probability = TRUE, col = "white")
curve(dnorm, from = min(x), to = max(x), add = TRUE, col = "red")
We see that the distribution density function lines up fairly well with the theoretical distribution. Let’s see now how this can be useful in some simulations.
Now, let’s see how we can use R to simulate the relationships between different random variables. We were able to derive some of these theoretically. The simulations help to verify that our calculations are correct and to find distributions that are too difficult to get analytically.
Often, we need to sample a simulation in two different directions.
That is, we need to sample a size size (what we have usually called
n
in our theoretical work) a large number of times. A good
way to do this in R is to create a matrix with N rows and n columns.
Then, we can compute summary statistics across the rows. For example,
below we simulate getting ten random uniform random variables and then
computing the smallest value from each trial.
We know from last time that this should have a Beta(1, 10) distribution. We can test this by drawing the histogram along with the density function of the Beta distribution.
hist(y, breaks = 50, probability = TRUE, col = "white")
curve(dbeta(x, 1, 10),
from = min(y), to = max(y), add = TRUE, col = "red")
Let’s see another similar result that we did not do (it involves the
two-dimensional change of variables formula). Consider two independent
gamma distributions with the same beta parameter ( called
rate
in R). Consider the proportion of x to the sum of x
and y. This is possibly interesting because Gamma is used to model wait
times. A classic example of this is to think of running two errands:
going to the post office and going to the bank. What is the proportion
of time you spend at the post office if the first is distribution
Gamma(2, 1) and the second is distributed Gamma(3, 1)?
It turns out that the probability will be a beta distribution with alpha equal to 1 (the shape parameter of x) and the beta equal to to 3 (the shape parameter of y).
hist(w, breaks = 50, probability = TRUE, col = "white")
curve(dbeta(x, 2, 3),
from = min(w), to = max(w), add = TRUE, col = "red")
Let’s do one more that we have already seen. Take two standard normal distributions and add their squares together.
We know the distribution should be a chi-squared with two degrees of freedom. Does this hold on the simulated data?
hist(x, breaks = 50, probability = TRUE, col = "white")
curve(dchisq(x, df = 2),
from = min(x), to = max(x), add = TRUE, col = "red")
Thankfully, yes, it does.
Let’s finish with an application of probability theory to statistical inference. We will read into R a dataset showing the average times that different mammals sleep (it’s a relatively well-known example dataset in statistical analysis). We then grab as two separate vectors the average hours of sleep based on two different orders of species: Rodentia and Primates.
df <- readr::read_csv("msleep.csv")
x <- df$sleep_total[df$order == "Rodentia"]
y <- df$sleep_total[df$order == "Primates"]
We want to determine if there is a real difference in the amount of sleep based on the order of the species, or if any difference might just be due to the random selection of species that are included in the dataset. The actual difference in means is just under 2 hours more sleep for the rodents:
## [1] 1.968182
But how can we determine if this difference could just be due to random chance. In other words, is this a statistical significant result or not? One way to do this, which we will go into detail on in the first few weeks of 330, is to show that the difference in means will (will a scale factor) have a Student T-Distribution if there is no difference. We can then use that to get an approximate measure of significance using the central limit theorem.
Another approach, which has gained in popularity over the past few
decades, is called the bootstrap. It is based entirely on a
probabilistic simulation. Consider taking all of the sleep times for
both groups of mammals, which we will call z
. Then,
generate random version of the two sets x
and
y
by sampling (with replacement) from the set of total
sleep times.
z <- c(x, y)
x_rand <- sample(z, size = length(x), replace = TRUE)
y_rand <- sample(z, size = length(y), replace = TRUE)
d_rand <- mean(x_rand) - mean(y_rand)
d_rand
## [1] 0.2787879
The value above shows the difference between the means that comes from a random assortment of the two labels Rodentia and Primates. If we do this a large number of times, we can get an approximate distribution of the null distribution under the assumption that there is no relationship between sleep times and the order of the mammal.
N <- 1e6
d_rand <- rep(0, N)
for (j in seq_along(d_rand))
{
x_rand <- sample(z, size = length(x), replace = TRUE)
y_rand <- sample(z, size = length(y), replace = TRUE)
d_rand[j] <- mean(x_rand) - mean(y_rand)
}
We can look at the distribution of the difference, which has an approximately normal distribution.
hist(d_rand, breaks = 50, probability = TRUE, col = "white")
curve(dnorm(x, mean = 0, sd = sd(d_rand)),
from = min(d_rand), to = max(d_rand), add = TRUE, col = "red")
We can ask, how often does a difference as large as the actual difference arise from a random assignment of the labels? This is the value that is called the bootstrap p-value.
## [1] 0.042767
So, the observed difference would only be as large as the random difference around 4% of the time. Traditionally, we say that a result is statistically significant if the probability is less than 0.05, so this result would be significant based on the bootstrap method. We will look at these concepts and the code in the 330 class next semester.