Start by reading in all of the R packages that we will need for today:

library(dplyr)
library(ggplot2)
library(tmodels)
library(readxl)

## Class data collection

We are going to build a dataset as a class. Download the datafile and save it as “class08.csv”. Read in the dataset and call it simply class using the read_csv function:

class <- read_csv

Run all three correlation tests on the data:

Are there any large differences between the p-values in the tests? What do you conclude from the analysis about the relationship between the variables?

## Graduate pay information by major

Now, read in the pay by major dataset that we used last time:

pay <- read_csv("https://raw.githubusercontent.com/statsmaths/stat_data/gh-pages/grad_info.csv")

Run Pearson’s product-moment correlation test on this dataset with median_pay as the response and unemployment as the independent variable. Note the correlation point estimate and the p-value:

tmod_pearson_correlation_test(median_pay ~ unemployment, data = pay)
##
## Pearson's product-moment correlation test
##
##  H0: True correlation is zero
##  HA: True correlation is non-zero
##
##  Test statistic: t(171) = -1.4317
##  P-value: 0.1541
##
##  Parameter: (Pearson) correlation coefficient
##  Point estimate: -0.10883
##  Confidence interval: [-0.253910,  0.041033]

Re-run the analysis with median_pay as the independent variable and unemployment as the response.

tmod_pearson_correlation_test(unemployment ~ median_pay, data = pay)
##
## Pearson's product-moment correlation test
##
##  H0: True correlation is zero
##  HA: True correlation is non-zero
##
##  Test statistic: t(171) = -1.4317
##  P-value: 0.1541
##
##  Parameter: (Pearson) correlation coefficient
##  Point estimate: -0.10883
##  Confidence interval: [-0.253910,  0.041033]

The results should be exactly the same due to symmetry in the test.

## Graduate pay by major in three categories

The following code creates a new variable in the dataset. Run it and look at the dataset to see what it does

pay$mcategory <- "non-science" pay$mcategory[pay$sciences == "yes"] <- "other science" pay$mcategory[grep("ENGINEERING", pay$major)] <- "engineering" The variable mcategory now has three different values: non-science, other science, and engineering. Use the one-way ANOVA function to test a relationship between median pay and the variable mcategory: tmod_one_way_anova_test(median_pay ~ mcategory, data=pay) ## ## One-way Analysis of Variance (ANOVA) ## ## H0: True means are the same in each group. ## HA: True means are the same in each group. ## ## Test statistic: F(2, 170) = 72.95 ## P-value: < 2.2e-16 Now, run the same test with Kruskal-Wallis rank sum test: tmod_kruskal_wallis_test(median_pay ~ mcategory, data=pay) ## ## Kruskal-Wallis rank sum test ## ## H0: Location parameters are the same in each group. ## HA: Location parameters are not the same in each group. ## ## Test statistic: chi-squared(2) = 59.129 ## P-value: 1.447e-13 Finally, make a copy of the dataset and introduce a bad data point as we did last time: pay_bad <- pay pay_bad$median_pay[173] <- 22000000

Verify that the one-way ANOVA function is sensitive to this outlier but the Kruskal-Wallis test is not:

tmod_one_way_anova_test(median_pay ~ mcategory, data=pay_bad)
##
## One-way Analysis of Variance (ANOVA)
##
##  H0: True means are the same in each group.
##  HA: True means are the same in each group.
##
##  Test statistic: F(2, 170) = 0.19071
##  P-value: 0.8265
tmod_kruskal_wallis_test(median_pay ~ mcategory, data=pay_bad)
##
## Kruskal-Wallis rank sum test
##
##  H0: Location parameters are the same in each group.
##  HA: Location parameters are not the same in each group.
##
##  Test statistic: chi-squared(2) = 56.896
##  P-value: 4.418e-13