Key Logs (Individual Keys)

Let’s read the word-level keylog data into R.

word <- read_csv("../data/keylog_word.csv")
word
## # A tibble: 3,329 × 11
##    id     task word_id word  nchar start   end duration gap_before gap_after
##    <chr> <dbl>   <dbl> <chr> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
##  1 A         1       1 out       3  1291  1728      437        462       683
##  2 A         1       2 in        2  2411  2564      153        683       286
##  3 A         1       3 the       3  2850  3141      291        286       423
##  4 A         1       4 unch…    10  3564  6074     2510        423       371
##  5 A         1       5 back…    10  6445  7970     1525        371       305
##  6 A         1       6 of        2  8275  8436      161        305       266
##  7 A         1       7 the       3  8702  8993      291        266       676
##  8 A         1       8 nunf…    14  9669 15103     5434        676       717
##  9 A         1       9 end       3 15820 16029      209        717       319
## 10 A         1      10 of        2 16348 16479      131        319       255
## # ℹ 3,319 more rows
## # ℹ 1 more variable: type <chr>

We are going to test the hypothesis that all students have the same average gap between finishing one word and starting the next on the 2nd task. We can grab the data for this using the following R code:

index <- (word$task == 2 & word$gap_after < 2000)
x <- word$gap_after[index]
block <- word$id[index]

Then, all of the derived variables can be computed with the following:

xbar <- tapply(x, block, mean)
xbar_all <- mean(x)
s2 <- tapply(x, block, var)
n <- tapply(x, block, length)
K <- length(unique(block))
N <- sum(n)

R has a nice syntax where we can do vectorized operations. So, if we add/multiple two vectors of the same thing, it will do these operations component-wise. If we add/multiply a constant with a vector, it will allow the constant to every entry. For example, here is the denominator of the F-statistic:

sum((n - 1) * s2) / (N - K)
## [1] 122995.4

Below, write the R code to create the F-statistic from the formula you derived on today’s worksheet. Save the result as an object named fstat:

# Question 01
fstat <- (sum(n*(xbar - xbar_all)^2) / (K - 1)) / (sum((n - 1) * s2) / (N - K))
fstat
## [1] 7.004472

Use the following code to compute the p-value of the F-statistic. Is the test significant at a 0.001 level?

1 - pf(fstat, df1 = (K - 1), df2 = (N - K))
## [1] 1.554312e-15

As with the other tests we have used, there are build-in R functions to do all of this work for us. Here is the code to run the analysis of variance:

summary(aov(x ~ block))
##               Df    Sum Sq Mean Sq F value   Pr(>F)    
## block         16  13784290  861518   7.004 1.58e-15 ***
## Residuals   1033 127054281  122995                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a lot of information in the output, some of which we do not need. You should see, though, that there are the degrees of freedom, the F-statistic, and the p-value. [While very close, you’ll probably notice some numerical instability of the p-value computation; it is a little different from the computation (at least on my machine).]