Notebook 04

This is the first time we have actually used R this semester. In case this is your first time with it, I’ve tried to keep the notebook nice and short so that you can take your time and ask a lot of questions about what’s going on. The text that is here on a white background is just plain text. You can write (almost) anything you want here and it should not cause any trouble. The actual R code is written on parts of the document that are shown behind a grey background. We can run this code and see the results. For example the code below takes the square root of the number 100. You can run it by hitting the green triangle in the upper-right corner of the grey box.

sqrt(100)

## [1] 10

The output should show up below the code when you run it. Wondering what the number one in square brackets is? It’s just counting the number of outputs (this is the “1st” and only output). We can also use R to create variables by assigning objects to names using the arrow operator <-. (note: the back quotes that you see here always indicate that the things inside of it is a command that you can run in R). For example, here is the code to create a variane called temp with the value of 2.4

temp <- 2.4

You could now, after running the code there, use the variable temp in any other code that you write. R has a large number of function that we can use to manipulate data. This includes everything from simple mathematical functions such as the sqrt you saw above to complex statistical models, which we will see later in this document.

I won’t give a full complete introduction to the entire R language here, but let’s see one more important thing: the pipe operator. This is the symbol |> (you may also see it written equivalently as %>%). It passes the output of one line of code to the next line. So, for example, this code takes the square root of temp and then applies the sin function to the output (for no reason other than just to show the idea of the pipe):

temp |>
  sqrt() |>
  sin()

## [1] 0.9997667

For every notebook in this class, you should run all of the code that I have given you in code chunks that look like those above. You will also see some chunks like the following that are empty and have a question number at the top. For these, you need to type in your own code to answer the preceeding question and then run it. So, let’s start. Below, apply the three following functions in order to the number stored in temp: (1) the log, (2) the square root, and (3) the tan function.

# Question 01
temp |>
  log() |>
  sqrt() |>
  tan()

## [1] 1.356845

Again, no reason to do this other than to practice.

Read Data into R

Let’s do something a lot more interesting. Run the following code to read in a cleaned version of the keylogging data that we created during the first week of the semester. There is a little extra code to deal with some missing data. NOTE: If you get an error, go to the top of the notebook and run the very first chunk to load in all the R packages that we need.

kl <- read_csv("../data/keylog.csv")
kl$key[is.na(kl$key)] <- " "
kl

## # A tibble: 19,893 × 7
##    id     task  time duration key    gap1  gap2
##    <chr> <dbl> <dbl>    <dbl> <chr> <dbl> <dbl>
##  1 A         1   226       96 "F"     359   263
##  2 A         1   585      121 "a"     244   123
##  3 A         1   829       64 "r"     140    76
##  4 A         1   969       73 " "     322   249
##  5 A         1  1291       76 "o"     211   135
##  6 A         1  1502       64 "u"     226   162
##  7 A         1  1728       64 "t"     213   149
##  8 A         1  1941       80 " "     470   390
##  9 A         1  2411       61 "i"     153    92
## 10 A         1  2564       33 "n"     115    82
## # ℹ 19,883 more rows

The dataset here consists of each each key that was typed by each student. The column id is an anonomized code for each student; the task tells us whether this is the copy task (1) or the free-write task (2); time is a timestamp in milliseconds from the time the window was opened; duration is the amount of time the key way held down for in milliseconds; key is the output of hitting the key; gap1 is the time from this key being pressed to the time the next key is pushed, and gap2 is the time from this key being released and the next key being pressed.

Take a few moments to scroll through the day and note that you can see (more or less) the text by reading down the key column. I have remove backspaces and other special characters, so it will not be a perfect match.

We can use the filter function to take a subset of our dataset. Here, for example, is the data from the second task for student “A”.

kl |>
  filter(id == "A", task == "2")

## # A tibble: 186 × 7
##    id     task  time duration key    gap1  gap2
##    <chr> <dbl> <dbl>    <dbl> <chr> <dbl> <dbl>
##  1 A         2   107       69 "M"     189   120
##  2 A         2   296       36 "y"       4   -32
##  3 A         2   300       28 "u"     139   111
##  4 A         2   439       76 " "    1262  1186
##  5 A         2  1701       64 " "     508   444
##  6 A         2  2209       88 "f"     198   110
##  7 A         2  2407      111 "a"     262   151
##  8 A         2  2669       57 "v"     206   149
##  9 A         2  2875       76 "i"     802   726
## 10 A         2  3677       69 "o"     122    53
## # ℹ 176 more rows

In the code below, select your own data. Scroll through and make sure that this is the correct data. If not, please let me know!

# Question 02
kl |>
  filter(id == "B", task == "2")  # this is for student "B"

## # A tibble: 522 × 7
##    id     task  time duration key    gap1  gap2
##    <chr> <dbl> <dbl>    <dbl> <chr> <dbl> <dbl>
##  1 B         2    68       92 "M"     225   133
##  2 B         2   293       81 "y"     113    32
##  3 B         2   406      100 " "     152    52
##  4 B         2   558      112 "f"     142    30
##  5 B         2   700      108 "a"     184    76
##  6 B         2   884      112 "v"     130    18
##  7 B         2  1014      104 "o"     116    12
##  8 B         2  1130      144 "r"     112   -32
##  9 B         2  1242       76 "i"     110    34
## 10 B         2  1352      136 "t"     195    59
## # ℹ 512 more rows

The following code constructs a histogram of the duration of each of the keys across all of the students for the first task. Take a moment to understand what the output is telling us.

kl |>
  filter(task == 1) |>
  ggplot(aes(x = duration)) +
    geom_histogram(color = "black", fill = "white", bins = 30)

Copy the code above into the code below and then change it to show a histogram of the variable gap1. You should notice that this is quite different.

# Question 03
kl |>
  filter(task == 1) |>
  ggplot(aes(x = gap1)) +
    geom_histogram(color = "black", fill = "white", bins = 30)

The issue with the histogram is that there are a few gaps that are very long. These are probably someone taking a break, checking their work, or making changes somewhere. In the code below, add an extra filter to only include those data where gap1 is less than 1 second (that is, 1000 milliseconds).

# Question 04
kl |>
  filter(task == 1) |>
  filter(gap1 < 1000) |>
  ggplot(aes(x = gap1)) +
    geom_histogram(color = "black", fill = "white", bins = 30)

Finally, do the same for the second gap:

# Question 05
kl |>
  filter(task == 1) |>
  filter(gap2 < 1000) |>
  ggplot(aes(x = gap2)) +
    geom_histogram(color = "black", fill = "white", bins = 30)

Do you think these are normal enough to apply the T-based confidence intervals? Hopefully so!

Confidence Interval for Duration

Let’s compute the sample mean of the duration of all the key presses across the dataset. There is the function mean in R to compute a sample mean. In order to get the data from one column of the dataset, we can use the format: DATANAME$VARNAME. In other words, write the name of the dataset, a dollar sign, and then a variable name. In the code below, compute the sample mean of the duration of the key presses:

# Question 06
mean(kl$duration)

## [1] 96.10169

Make sure that you understand what this means in terms of seconds.

Now, let’s try to build a 99% confidence interval for the duration. We can get the critical values (the t sub alpha in the notes) for the test with the following code, which uses the quantile function qt for the T distribution and the nrow function to get the sample size of our data (very large, about 20k).

tval <- qt(1 - 0.01/2, df = nrow(kl))
tval

## [1] 2.576076

Using our formula for the T-based confidence interval, use R to compute the lower-bound of the confidence interval. You can use the function var to compute the sample variance.

# Question 07
mean(kl$duration) - tval * sqrt(var(kl$duration) / (nrow(kl) - 1))

## [1] 95.46117

And then, below use a similar approach to get the upper bound.

# Question 08
mean(kl$duration) + tval * sqrt(var(kl$duration) / (nrow(kl) - 1))

## [1] 96.74222

We can get R to do all of this work for us using the function t.test. Here is an example of how to use the function to measure the duration of the key presses using a 99% confidence level.

t.test(kl$duration, conf.level = 0.99)

## 
##  One Sample t-test
## 
## data:  kl$duration
## t = 386.52, df = 19892, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  95.46119 96.74220
## sample estimates:
## mean of x 
##  96.10169

There is a lot of information in the output. We’ll learn about the parts at the top next week. For now, just see the confidence interval and mean at the bottom. Do they match what you had manually above?

Confidence Interval for Gap

Use the t.test function to get a 99% percent confidence interval for gap2 the time between hitting subsequent keys.

# Question 09
t.test(kl$gap1, conf.level = 0.99)

## 
##  One Sample t-test
## 
## data:  kl$gap1
## t = 26.646, df = 19858, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  253.4688 307.7234
## sample estimates:
## mean of x 
##  280.5961

There is a problem with this data, however. We saw above that the gap data are heavily skewed. This makes it inappropriate for the T distribution as-is. We can fix this by filter the data to include only those gaps that are under 1 second (the cut-off is a bit arbitrary, but 1s works well). We can do this in R by adding [kl$gap1 < 1000] to the end of the variable name. In the code below, compute the confidence interval for gap1 using only those values that are under 1s.

# Question 10
t.test(kl$gap1[kl$gap1 < 1000], conf.level = 0.99)

## 
##  One Sample t-test
## 
## data:  kl$gap1[kl$gap1 < 1000]
## t = 163.74, df = 19206, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  167.5610 172.9177
## sample estimates:
## mean of x 
##  170.2393

Compare this to the first output. Does it seem reasonable to filter the data like this?

Confidence Interval for the Variance

We also have a formula for computing a confidence interval for the sample variance. Below, compute the sample variance of the duration?

# Question 11
var(kl$duration)

## [1] 1229.786

Here is the code to get the critical value for the lower bound of the chi-squared distribution:

tval1 <- qchisq(1 - 0.01/2, df = nrow(kl))
tval1

## [1] 20410.54

And here is the code for the upper bound:

tval2 <- qchisq(0.01/2, df = nrow(kl))
tval2

## [1] 19382.97

Below, compute the lower-bound of the 99% confidence interval for the variance of the duration:

# Question 12
var(kl$duration) * (nrow(kl) - 1) / tval1

## [1] 1198.543

And then, the upper bound:

# Question 13
var(kl$duration) * (nrow(kl) - 1) / tval2

## [1] 1262.082

Notice that the variance is in squared milliseconds. Typically, in an applied problem, it makes more sense to convert these estimates into a confidence interval for the square root of the variance (called the standard deviation). Compute the bounds on the standard deviation below:

sqrt(c(var(kl$duration) * (nrow(kl) - 1) / tval1,
       var(kl$duration) * (nrow(kl) - 1) / tval2))

## [1] 34.61997 35.52580

Go back and look at the histogram of the duration. Do these bounds seem reasonable for the standard deviation? A good rule of thumb is that about 95% of the data should be within +/- two standard deviations of the mean.

A Look Ahead

Hopefully you found it insightful to work with some real data. Unfortunately, it is rarely interesting to consider means on their own (at least unless you have a lot of prior knowledge about a the data you are working with). What will be much easier to motivate is studying differences between duration. For example, do you take more time (or less) to type when you are copying versus writing your own text? Are there differences between the people in the class in the speed at which they type? Are there letter pairs that are harder/easier to type? We will build all of the tools to do these things over the next few weeks.