Start by reading in all of the R packages that we will need for today:

```
library(dplyr)
library(ggplot2)
library(tmodels)
library(readr)
library(readxl)
```

We are going to build a dataset as a class. Download the datafile and save it as “class08.csv”. Read in the dataset and call it simply `class`

using the `read_csv`

function:

`class <- read_csv`

Run all three correlation tests on the data:

Are there any large differences between the p-values in the tests? What do you conclude from the analysis about the relationship between the variables?

Now, read in the pay by major dataset that we used last time:

`pay <- read_csv("https://raw.githubusercontent.com/statsmaths/stat_data/gh-pages/grad_info.csv")`

Run Pearson’s product-moment correlation test on this dataset with `median_pay`

as the response and `unemployment`

as the independent variable. Note the correlation point estimate and the p-value:

`tmod_pearson_correlation_test(median_pay ~ unemployment, data = pay)`

```
##
## Pearson's product-moment correlation test
##
## H0: True correlation is zero
## HA: True correlation is non-zero
##
## Test statistic: t(171) = -1.4317
## P-value: 0.1541
##
## Parameter: (Pearson) correlation coefficient
## Point estimate: -0.10883
## Confidence interval: [-0.253910, 0.041033]
```

Re-run the analysis with `median_pay`

as the independent variable and `unemployment`

as the response.

`tmod_pearson_correlation_test(unemployment ~ median_pay, data = pay)`

```
##
## Pearson's product-moment correlation test
##
## H0: True correlation is zero
## HA: True correlation is non-zero
##
## Test statistic: t(171) = -1.4317
## P-value: 0.1541
##
## Parameter: (Pearson) correlation coefficient
## Point estimate: -0.10883
## Confidence interval: [-0.253910, 0.041033]
```

The results should be exactly the same due to symmetry in the test.

The following code creates a new variable in the dataset. Run it and look at the dataset to see what it does

```
pay$mcategory <- "non-science"
pay$mcategory[pay$sciences == "yes"] <- "other science"
pay$mcategory[grep("ENGINEERING", pay$major)] <- "engineering"
```

The variable `mcategory`

now has three different values: non-science, other science, and engineering. Use the one-way ANOVA function to test a relationship between median pay and the variable `mcategory`

:

`tmod_one_way_anova_test(median_pay ~ mcategory, data=pay)`

```
##
## One-way Analysis of Variance (ANOVA)
##
## H0: True means are the same in each group.
## HA: True means are the same in each group.
##
## Test statistic: F(2, 170) = 72.95
## P-value: < 2.2e-16
```

Now, run the same test with Kruskal-Wallis rank sum test:

`tmod_kruskal_wallis_test(median_pay ~ mcategory, data=pay)`

```
##
## Kruskal-Wallis rank sum test
##
## H0: Location parameters are the same in each group.
## HA: Location parameters are not the same in each group.
##
## Test statistic: chi-squared(2) = 59.129
## P-value: 1.447e-13
```

Finally, make a copy of the dataset and introduce a bad data point as we did last time:

```
pay_bad <- pay
pay_bad$median_pay[173] <- 22000000
```

Verify that the one-way ANOVA function is sensitive to this outlier but the Kruskal-Wallis test is not:

`tmod_one_way_anova_test(median_pay ~ mcategory, data=pay_bad)`

```
##
## One-way Analysis of Variance (ANOVA)
##
## H0: True means are the same in each group.
## HA: True means are the same in each group.
##
## Test statistic: F(2, 170) = 0.19071
## P-value: 0.8265
```

`tmod_kruskal_wallis_test(median_pay ~ mcategory, data=pay_bad)`

```
##
## Kruskal-Wallis rank sum test
##
## H0: Location parameters are the same in each group.
## HA: Location parameters are not the same in each group.
##
## Test statistic: chi-squared(2) = 56.896
## P-value: 4.418e-13
```