Notebook 05

Let’s read the keylogging data into R again. We will continue to use this over the next weeks as it has a surprisingly large number of different statistical applications.

kl <- read_csv("../data/keylog.csv")
kl$key[is.na(kl$key)] <- " "
kl <- filter(kl, !is.na(gap1), !is.na(gap2))
kl

## # A tibble: 19,859 × 7
##    id     task  time duration key    gap1  gap2
##    <chr> <dbl> <dbl>    <dbl> <chr> <dbl> <dbl>
##  1 A         1   226       96 "F"     359   263
##  2 A         1   585      121 "a"     244   123
##  3 A         1   829       64 "r"     140    76
##  4 A         1   969       73 " "     322   249
##  5 A         1  1291       76 "o"     211   135
##  6 A         1  1502       64 "u"     226   162
##  7 A         1  1728       64 "t"     213   149
##  8 A         1  1941       80 " "     470   390
##  9 A         1  2411       61 "i"     153    92
## 10 A         1  2564       33 "n"     115    82
## # ℹ 19,849 more rows

Today, we will look at two different samples, the amount of time after typing a space bar until another key is typed, and the amount of time after any non-space key and the next key is typed. We will pull these out as two variables called x and y to match the notes; we also remove long gaps as per the discussion last time, and restrict ourselves to only looking at the first copy task.

x <- kl$gap2[kl$task == 1 & kl$key == " " & kl$gap2 < 1000]
y <- kl$gap2[kl$task == 1 & kl$key != " " & kl$gap2 < 1000]

We will also save the variable n and m to match those from the notes:

n <- length(x)
m <- length(y)

In the code below, compute the best guess point estimator for the difference in the means of these two gaps:

# Question 01
mean(x) - mean(y)

## [1] 77.33267

Now, let’s build a two-sample confidence interval for the difference in means. To start, get the value t sub alpha, which we call tval in R, for the appropriate degrees of freedom derived on the worksheet for a 95% confidence interval:

# Question 02
tval <- qt(1 - 0.05/2, df = nrow(kl))
tval

## [1] 1.960083

Next, compute the pooled variance in the code below, saving the output as a variable called sp.

# Question 03
sp <- ((n - 1) * var(x) + (m - 1) * var(y)) / (n + m - 2)
sp

## [1] 23743.95

Finally, compute the confidence interval in the code below by applying the formula you constructed:

# Question 04
c((mean(x) - mean(y)) - tval * sqrt(sp * (1/n + 1/m)),
  (mean(x) - mean(y)) + tval * sqrt(sp * (1/n + 1/m)))

## [1] 70.22590 84.43943

Now, the easy part. Run the following code to apply the result you have above with the built-in function in R:

t.test(x, y, conf.level = 0.95, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  x and y
## t = 21.329, df = 12988, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  70.22567 84.43966
## sample estimates:
## mean of x mean of y 
## 141.32749  63.99483

Note that the R function has an option that we set to indicate that we want to assume that the variances are equal. Repeat the procedure below but set var.equal to FALSE. This runs a more complex algorithm that is (approximately) valid in the case that the samples have different variances. Note how it affects the results.

# Question 05
t.test(x, y, conf.level = 0.95, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 17.147, df = 2629.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  68.48924 86.17609
## sample estimates:
## mean of x mean of y 
## 141.32749  63.99483

You should see that the confidence interval gets a bit larger.

Finally, let’s apply the confidence interval for the ratio of the variances. In the interest of time, I have typed out the code below to generate the formula using the equation you should have found on the worksheet:

fval1 <- qf(1 - 0.05/2, df1 = n - 1, df2 = m - 1)
fval2 <- qf(0.05/2, df1 = n - 1, df2 = m - 1)

c(var(x) / var(y) * fval2, var(x) / var(y) * fval1)

## [1] 1.827147 2.081836

Below, apply the function var.test to the two samples (no other inputs required) in order to generate a 95%-confidence interval. There are a number of different outputs; find the confidence interval and verify that it matchs the code above.

# Question 06
var.test(x, y)

## 
##  F test to compare two variances
## 
## data:  x and y
## F = 1.9517, num df = 2167, denom df = 10821, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.829776 2.084831
## sample estimates:
## ratio of variances 
##           1.951741