Let’s read the keylogging data into R again. We will continue to use this over the next weeks as it has a surprisingly large number of different statistical applications.
kl <- read_csv("../data/keylog.csv")
kl$key[is.na(kl$key)] <- " "
kl <- filter(kl, !is.na(gap1), !is.na(gap2))
kl
## # A tibble: 19,859 × 7
## id task time duration key gap1 gap2
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 A 1 226 96 "F" 359 263
## 2 A 1 585 121 "a" 244 123
## 3 A 1 829 64 "r" 140 76
## 4 A 1 969 73 " " 322 249
## 5 A 1 1291 76 "o" 211 135
## 6 A 1 1502 64 "u" 226 162
## 7 A 1 1728 64 "t" 213 149
## 8 A 1 1941 80 " " 470 390
## 9 A 1 2411 61 "i" 153 92
## 10 A 1 2564 33 "n" 115 82
## # ℹ 19,849 more rows
Today, we will look at two different samples, the amount of time after typing a space bar until another key is typed, and the amount of time after any non-space key and the next key is typed. We will pull these out as two variables called x and y to match the notes; we also remove long gaps as per the discussion last time, and restrict ourselves to only looking at the first copy task.
x <- kl$gap2[kl$task == 1 & kl$key == " " & kl$gap2 < 1000]
y <- kl$gap2[kl$task == 1 & kl$key != " " & kl$gap2 < 1000]
We will also save the variable n
and m
to
match those from the notes:
In the code below, compute the best guess point estimator for the difference in the means of these two gaps:
## [1] 77.33267
Now, let’s build a two-sample confidence interval for the difference
in means. To start, get the value t sub alpha, which we call
tval
in R, for the appropriate degrees of freedom derived
on the worksheet for a 95% confidence interval:
## [1] 1.960083
Next, compute the pooled variance in the code below, saving the
output as a variable called sp
.
## [1] 23743.95
Finally, compute the confidence interval in the code below by applying the formula you constructed:
# Question 04
c((mean(x) - mean(y)) - tval * sqrt(sp * (1/n + 1/m)),
(mean(x) - mean(y)) + tval * sqrt(sp * (1/n + 1/m)))
## [1] 70.22590 84.43943
Now, the easy part. Run the following code to apply the result you have above with the built-in function in R:
##
## Two Sample t-test
##
## data: x and y
## t = 21.329, df = 12988, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 70.22567 84.43966
## sample estimates:
## mean of x mean of y
## 141.32749 63.99483
Note that the R function has an option that we set to indicate that
we want to assume that the variances are equal. Repeat the procedure
below but set var.equal
to FALSE
. This runs a
more complex algorithm that is (approximately) valid in the case that
the samples have different variances. Note how it affects the
results.
##
## Welch Two Sample t-test
##
## data: x and y
## t = 17.147, df = 2629.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 68.48924 86.17609
## sample estimates:
## mean of x mean of y
## 141.32749 63.99483
You should see that the confidence interval gets a bit larger.
Finally, let’s apply the confidence interval for the ratio of the variances. In the interest of time, I have typed out the code below to generate the formula using the equation you should have found on the worksheet:
fval1 <- qf(1 - 0.05/2, df1 = n - 1, df2 = m - 1)
fval2 <- qf(0.05/2, df1 = n - 1, df2 = m - 1)
c(var(x) / var(y) * fval2, var(x) / var(y) * fval1)
## [1] 1.827147 2.081836
Below, apply the function var.test
to the two samples
(no other inputs required) in order to generate a 95%-confidence
interval. There are a number of different outputs; find the confidence
interval and verify that it matchs the code above.
##
## F test to compare two variances
##
## data: x and y
## F = 1.9517, num df = 2167, denom df = 10821, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.829776 2.084831
## sample estimates:
## ratio of variances
## 1.951741