Start by reading in all of the R packages that we will need for today:

library(dplyr)
library(ggplot2)
library(tmodels)
library(readr)
library(readxl)

Class data collection

We are going to build a dataset as a class. Download the datafile and save it as “class08.csv”. Read in the dataset and call it simply class using the read_csv function:

class <- read_csv

Run all three correlation tests on the data:

Are there any large differences between the p-values in the tests? What do you conclude from the analysis about the relationship between the variables?

Graduate pay information by major

Now, read in the pay by major dataset that we used last time:

pay <- read_csv("https://raw.githubusercontent.com/statsmaths/stat_data/gh-pages/grad_info.csv")

Run Pearson’s product-moment correlation test on this dataset with median_pay as the response and unemployment as the independent variable. Note the correlation point estimate and the p-value:

tmod_pearson_correlation_test(median_pay ~ unemployment, data = pay)
## 
## Pearson's product-moment correlation test
## 
##  H0: True correlation is zero
##  HA: True correlation is non-zero
## 
##  Test statistic: t(171) = -1.4317
##  P-value: 0.1541
## 
##  Parameter: (Pearson) correlation coefficient
##  Point estimate: -0.10883
##  Confidence interval: [-0.253910,  0.041033]

Re-run the analysis with median_pay as the independent variable and unemployment as the response.

tmod_pearson_correlation_test(unemployment ~ median_pay, data = pay)
## 
## Pearson's product-moment correlation test
## 
##  H0: True correlation is zero
##  HA: True correlation is non-zero
## 
##  Test statistic: t(171) = -1.4317
##  P-value: 0.1541
## 
##  Parameter: (Pearson) correlation coefficient
##  Point estimate: -0.10883
##  Confidence interval: [-0.253910,  0.041033]

The results should be exactly the same due to symmetry in the test.

Graduate pay by major in three categories

The following code creates a new variable in the dataset. Run it and look at the dataset to see what it does

pay$mcategory <- "non-science"
pay$mcategory[pay$sciences == "yes"] <- "other science"
pay$mcategory[grep("ENGINEERING", pay$major)] <- "engineering"

The variable mcategory now has three different values: non-science, other science, and engineering. Use the one-way ANOVA function to test a relationship between median pay and the variable mcategory:

tmod_one_way_anova_test(median_pay ~ mcategory, data=pay)
## 
## One-way Analysis of Variance (ANOVA)
## 
##  H0: True means are the same in each group.
##  HA: True means are the same in each group.
## 
##  Test statistic: F(2, 170) = 72.95
##  P-value: < 2.2e-16

Now, run the same test with Kruskal-Wallis rank sum test:

tmod_kruskal_wallis_test(median_pay ~ mcategory, data=pay)
## 
## Kruskal-Wallis rank sum test
## 
##  H0: Location parameters are the same in each group.
##  HA: Location parameters are not the same in each group.
## 
##  Test statistic: chi-squared(2) = 59.129
##  P-value: 1.447e-13

Finally, make a copy of the dataset and introduce a bad data point as we did last time:

pay_bad <- pay
pay_bad$median_pay[173] <- 22000000

Verify that the one-way ANOVA function is sensitive to this outlier but the Kruskal-Wallis test is not:

tmod_one_way_anova_test(median_pay ~ mcategory, data=pay_bad)
## 
## One-way Analysis of Variance (ANOVA)
## 
##  H0: True means are the same in each group.
##  HA: True means are the same in each group.
## 
##  Test statistic: F(2, 170) = 0.19071
##  P-value: 0.8265
tmod_kruskal_wallis_test(median_pay ~ mcategory, data=pay_bad)
## 
## Kruskal-Wallis rank sum test
## 
##  H0: Location parameters are the same in each group.
##  HA: Location parameters are not the same in each group.
## 
##  Test statistic: chi-squared(2) = 56.896
##  P-value: 4.418e-13