Let’s do in R what we did manually, roll a die 160 times:
## [1] 1 4 1 2 5 3 6 2 3 3 1 5 5 2 6 6 2 1 5 5 1 1 6 5 5 2 2 6 1 4 1 4 3 6 2
## [36] 2 6 4 4 4 2 4 1 6 1 4 1 6 2 3 2 6 6 2 5 2 6 6 6 1 3 3 6 4 6 3 1 4 5 1
## [71] 1 6 4 5 5 4 6 5 4 4 1 5 5 6 1 1 3 6 2 2 3 6 2 4 3 5 2 2 1 3 3 2 2 5 2
## [106] 5 4 5 4 6 1 3 2 3 3 1 5 6 6 6 4 4 1 5 5 6 1 3 6 3 6 3 3 4 1 1 4 2 6 1
## [141] 2 3 4 1 3 5 3 4 2 1 4 1 4 2 5 2 2 2 3 1
We can turn this into a set of counts, following the multinomial distribution:
## [1] 30 29 24 25 23 29
What are the expected counts under the null-hypothesis that all of the sides are equally likely? That’s just 1/6 times the number of samples.
## [1] 26.66667 26.66667 26.66667 26.66667 26.66667 26.66667
And the G-Score is given by the formula that we derived on the the notes:
## [1] 1.708677
This should have a chi-squared distribution with 6-1 degrees of freedom (the number of sides of the die, minus one because the last probability depends on all of the others). So, what is the critical value for rejecting the null hypothesis, that’s given here for a 95% confidence level (we always use a one-sided test with the log-likelihood):
## [1] 11.0705
We see that we would not reject the null hypothesis since the critical value is larger than our test statistic. We could also convert the test statistic into a p-value as follows:
## [1] 0.8878047
And again, we see that it is not significant.
Since it is easy in R, let’s simulate a large number of trials like the one we did manually, recording the G score for each trial.
N <- 10000
n <- 16 * 10
G <- rep(NA, N)
e <- rep(1/6, 6) * n
for (j in seq_along(G))
{
y <- sample(seq_len(6), size = n, replace = TRUE)
x <- as.numeric(table(y))
G[j] <- -2 * sum( x * log(e / x) )
}
We can look at the distribution and see that it does in fact closely following a chi-squared distribution with 6-1 degrees of freedom:
hist(G, breaks = 50, probability = TRUE, col = "white")
curve(dchisq(x, df = 6 - 1),
from = min(G), to = max(G), add = TRUE, col = "red")
How often would be (incorrectly) reject the null hypothesis?
## [1] 0.0541
This is almost exactly what we would want (0.05 is ideal), with the differences being due a bit to the approximation of the chi-squared and a bit due to the noise of the simulation.
Now, we will simulate the unfair die that we also did. Let’s start with a single die:
y <- sample(seq_len(6), size = n, replace = TRUE)
y[y >= 4] <- sample(seq_len(6), size = sum(y >= 4), replace = TRUE)
table(y)
## y
## 1 2 3 4 5 6
## 40 37 35 19 17 12
The bias towards the small values is fairly noticeable. Let’s compute the G-score:
## [1] 28.35575
This has the same null-distribution as above, so we can see that it will be rejected. What’s the p-value?
## [1] 3.100986e-05
So, we see that the reject the null hypothesis since the p-value is much smaller than the cut-off value of 0.05 (or even 0.001).
Just as above, we can simulate this for a large number of trials.
N <- 10000
n <- 16 * 10
G <- rep(NA, N)
e <- rep(1/6, 6) * n
for (j in seq_along(G))
{
y <- sample(seq_len(6), size = n, replace = TRUE)
y[y >= 4] <- sample(seq_len(6), size = sum(y >= 4), replace = TRUE)
x <- as.numeric(table(y))
G[j] <- -2 * sum( x * log(e / x) )
}
How often do we actually reject the null hypothesis here:
##
## TRUE
## 10000
Always! That’s why I felt confident making you do this in class that we would get a significant result manually as well. Let’s see how this compares the chi-squared:
hist(G, breaks = 50, probability = TRUE, col = "white")
curve(dchisq(x, df = 6 - 1),
from = min(G), to = max(G), add = TRUE, col = "red")
As hoped, we see that the distribution is very different than the chi-squared, allowing the test to have sufficient power to detect deviations from the null hypothesis.
No more die-rolling. But, let’s see in R a final extension for today of this technique. We will generate some data from a Poisson distribution with lambda equal to 2.
We will ignore that we know how the data were generated. Let’s look at a table of the values:
## y
## 0 1 2 3 4 5 6
## 25 64 43 43 13 7 5
Let’s say we wanted to do an hypothesis test with the null hypothesis that the data in fact came from a Poisson distribution. If we also knew lambda, that would be easy with the approach above, but we don’t. So, instead, we estimate the value of lambda using the MLE:
## [1] 1.98
Another difficulty is that we have a few high counts that are fairly rare. We will use a standard technique of collapsing all of the large values into a collective category (here, I chose the value of 4 and above). The expected probabilities would then be:
ptilde <- dpois(seq(0, 4), lambda = lhat)
ptilde[length(ptilde)] <- 1 - sum(ptilde[-length(ptilde)])
ptilde
## [1] 0.1380692 0.2733771 0.2706433 0.1786246 0.1392858
And the expected counts:
## [1] 27.61385 54.67542 54.12866 35.72492 27.85715
The observed counts are as follows, collapsing the largest values in a catch-all finaly category:
## [1] 25 64 43 43 25
Now, we just compute G as before:
## [1] 5.919403
How many degrees of freedom are there? Well, we have 5 categories, so our parameter space (big Theta) has 5-1 degrees of freedom. This time, however, the null hypothesis has 1 degree of freedom (corresponding to estimating the value of lambda). So, the degrees of freedom of the chi-squared should be (5 - 1) - 1, or 3. Here is the correct computation of the p-value:
## [1] 0.115598
So, as desired since we did generate this from a Poisson distribution, we would fail to reject the null hypothesis.
Let’s do what we did above again: simulate this many times.
N <- 10000
n <- 200
G <- rep(NA, N)
temp <- rep(NA, N)
for (j in seq_along(G))
{
y <- rpois(n, lambda = 2)
lhat <- mean(y)
ptilde <- dpois(seq(0, 4), lambda = lhat)
ptilde[length(ptilde)] <- 1 - sum(ptilde[-length(ptilde)])
e <- ptilde * n
x <- as.numeric(table(y))
x[5] <- sum(x[seq(5, length(x))])
x <- x[seq(1, 5)]
G[j] <- -2 * sum( x * log(e / x) )
}
The histogram does generally follow the chi-squared distribution with 6 degrees of freedom:
hist(G, breaks = 50, probability = TRUE, col = "white")
curve(dchisq(x, df = 5 - 1 - 1),
from = min(G), to = max(G), add = TRUE, col = "red")
And the probability of falsely rejecting the null hypothesis, again, is as expected.
## [1] 0.0545
Let’s do one last simulation, this time jumping right to the larger version. Here, we will generate data from the zero-inflated Poisson distribution (or ZIP): it’s like a Poisson, but some percentage of the time (we will use 10%) we artificially convert the value to zero. Let’s see if we can detect this in the simulation:
N <- 10000
n <- 200
G <- rep(NA, N)
temp <- rep(NA, N)
for (j in seq_along(G))
{
y <- rpois(n, lambda = 2)
y[runif(length(y)) < 0.1] <- 0
lhat <- mean(y)
ptilde <- dpois(seq(0, 4), lambda = lhat)
ptilde[length(ptilde)] <- 1 - sum(ptilde[-length(ptilde)])
e <- ptilde * n
x <- as.numeric(table(y))
x[5] <- sum(x[seq(5, length(x))])
x <- x[seq(1, 5)]
G[j] <- -2 * sum( x * log(e / x) )
}
How often do we (correctly, in this case) reject the null hypothesis that this is a proper Poisson distribution:
## [1] 0.5645
A little more than half of the time. Not nearly are powerful as in the die-roll case, but the ZIP with 10% converted to zero is not nearly as different as the Poisson in this case. If we increased the percentage or increased lambda, the difference would be easier to detect.