Start by reading in all of the R packages that we will need for today:

```
library(dplyr)
library(ggplot2)
library(tmodels)
library(readr)
library(readxl)
```

Start by reading in the following dataset that describe the salaries of recent graduates based on their major

`pay <- read_csv("https://raw.githubusercontent.com/statsmaths/stat_data/gh-pages/grad_info.csv")`

Open the dataset with the data viewer and make sure that you understand what the variables all mean (if something is unclear, just ask!). The unit of observation here is a major.

Run a T-test that predicts the median pay based on whether a major is in the sciences:

`tmod_t_test(median_pay ~ sciences, data = pay)`

```
##
## Two Sample t-test
##
## H0: Difference in true means is zero
## HA: Difference in true means is non-zero
##
## Test statistic: t(63.226) = -6.0656
## P-value: 8.118e-08
##
## Parameter: Difference in means (no - yes)
## Point estimate: -12645
## Confidence interval: [-16811.0, -8479.4]
```

Is the p-value significant? Which group has a lower unemployment rate? How would you describe the results of this study?

Run a Mann-Whitney test for median pay; how does the p-value compare to that of the T-test?

`tmod_mann_whitney_test(median_pay ~ sciences, data = pay)`

```
##
## Mann-Whitney rank sum test with continuity correction
##
## H0: True location shift is equal to zero
## HA: True location shift is not equal to zero
##
## Test statistic: W = 1217.5
## P-value: 9.696e-11
```

Now, run a T-test that predicts the unemployment rate based on whether a major is in the sciences:

`tmod_t_test(unemployment ~ sciences, data = pay)`

```
##
## Two Sample t-test
##
## H0: Difference in true means is zero
## HA: Difference in true means is non-zero
##
## Test statistic: t(98.573) = 2.5265
## P-value: 0.01311
##
## Parameter: Difference in means (no - yes)
## Point estimate: 0.012498
## Confidence interval: [0.0026821, 0.0223140]
```

Is the p-value significant? Which group has a lower unemployment rate? How would you describe the results of this study?

The tail function shows the last few rows of your dataset:

`tail(pay)`

```
## # A tibble: 6 x 4
## major sciences unemployment median_pay
## <chr> <chr> <dbl> <dbl>
## 1 COMPOSITION AND RHETORIC no 0.0817 27000
## 2 ZOOLOGY yes 0.0463 26000
## 3 EDUCATIONAL PSYCHOLOGY no 0.0651 25000
## 4 CLINICAL PSYCHOLOGY no 0.149 25000
## 5 COUNSELING PSYCHOLOGY no 0.0536 23400
## 6 LIBRARY SCIENCE no 0.105 22000
```

Library science majors do not make very much money. What if someone accidentally wrote down the pay of 22k per year as 22 million per year? We can change this one value with the following R code:

`pay$median_pay[173] <- 22000000`

Now re-run the T-test here predicting the median pay as a result of whether a major is in the sciences:

`tmod_t_test(median_pay ~ sciences, data = pay)`

```
##
## Two Sample t-test
##
## H0: Difference in true means is zero
## HA: Difference in true means is non-zero
##
## Test statistic: t(119.03) = 0.93151
## P-value: 0.3535
##
## Parameter: Difference in means (no - yes)
## Point estimate: 170500
## Confidence interval: [-191940, 532950]
```

Is the test still significant? Now, re-run the Mann-Whitney test:

`tmod_mann_whitney_test(median_pay ~ sciences, data = pay)`

```
##
## Mann-Whitney rank sum test with continuity correction
##
## H0: True location shift is equal to zero
## HA: True location shift is not equal to zero
##
## Test statistic: W = 1270.5
## P-value: 3.039e-10
```

Is this test still significant? How much does the test statistic change compared to the original dataset? Which of the two tests is more robust to one bad data point?