Class 08: Inference for Correlation; Multiple Means

Learning Objectives

• Understand the correlation between two variables.
• Know how to use the three hypothesis tests for testing correlation between two variables.
• Use the One-way ANOVA and Kruskal-Wallis rank sum test to extend the T-Test to 3 or more groups.

Correlation

You have probably heard the term correlation, but may not have ever seen a formal definition. If we have pairs of observations from two variables x and y, the correlation between them is written by the following seemingly complex formula:

1. The correlation is always between -1 and 1
2. The correlation of two unrelated variables is 0.
3. The correlation is positive if y “tends to be larger than its mean” whenever x is larger than its mean.
4. The correlation is negative if y “tends to be less than its mean” whenever x is larger than its mean.

Here are some examples of correlations for various variable plotted along the x- and y-axes:

Testing correlation

Today, let’s take a look at a different version of the same dataset from last class. Now we have two continuous variables: the city fuel efficency and the highway fuel efficency.

To run an hypothesis test on the correlation between these two variables, we will use a tmodels function with a similar format to the other tests that we have used. The test is called “Pearson’s product-moment correlation”:

The p-value is very low (less than 0.000000000000022%) and the sample correlation of 0.95592 is quite close to 1. We should not be surprised that these two variables are highly related. Notice that in just this one case, it doesn’t matter which variable is treated as the response and which one is the effect (the output is exactly the same):

There are two alternatives to Pearson’s correlation test available in tmodels. Both are similar to the Mann-Whitney test in that they are more robust to outliers and bad data points but more likely to incorrectly accept the null-hypothesis when it is not true. The first is “Kendall’s rank correlation tau test”:

And the other is “Spearman’s rank correlation rho test”:

I will be completly honest that I do not have a good sense of when you should use Kendall’s test or Spearman’s test. Just be familiar with each of them as they are both commonly used across the sciences and social sciences.

Testing multiple means

As a totally different direction, let’s return briefly to testing the means of a random variable by groups. Take another look at a version of the cars dataset; now it includes information about several different manufacturers:

The tmod_mean_by_group function shows us the means of the fuel efficency for each manufacturer:

What if we want to run an hypothesis test with:

• H0: The means are the same for each group.
• HA: Some of the means are different between each group.

Running the T-test will not work (if you go back to the last notes, you’ll see that it cannot be extended easily to more than two groups):

The “One-way Analysis of Variance (ANOVA)” test is the multiple group extension of the T-Test:

The Mann-Whitney test also has a multi-group extension called the “Kruskal-Wallis rank sum test”:

Notice that these tests do not give any point estimates; they just tell us whether there seem to be any differences in means across the groups.