Start by reading in all of the R packages that we will need for today:

library(dplyr)
library(ggplot2)
library(tmodels)
library(readxl)

## Graduate pay information by major

Start by reading in the following dataset that describe the salaries of recent graduates based on their major

pay <- read_csv("https://raw.githubusercontent.com/statsmaths/stat_data/gh-pages/grad_info.csv")

Open the dataset with the data viewer and make sure that you understand what the variables all mean (if something is unclear, just ask!). The unit of observation here is a major.

Run a T-test that predicts the median pay based on whether a major is in the sciences:

tmod_t_test(median_pay ~ sciences, data = pay)
##
## Two Sample t-test
##
##  H0: Difference in true means is zero
##  HA: Difference in true means is non-zero
##
##  Test statistic: t(63.226) = -6.0656
##  P-value: 8.118e-08
##
##  Parameter: Difference in means (no - yes)
##  Point estimate: -12645
##  Confidence interval: [-16811.0,  -8479.4]

Is the p-value significant? Which group has a lower unemployment rate? How would you describe the results of this study?

Run a Mann-Whitney test for median pay; how does the p-value compare to that of the T-test?

tmod_mann_whitney_test(median_pay ~ sciences, data = pay)
##
## Mann-Whitney rank sum test with continuity correction
##
##  H0: True location shift is equal to zero
##  HA: True location shift is not equal to zero
##
##  Test statistic: W = 1217.5
##  P-value: 9.696e-11

Now, run a T-test that predicts the unemployment rate based on whether a major is in the sciences:

tmod_t_test(unemployment ~ sciences, data = pay)
##
## Two Sample t-test
##
##  H0: Difference in true means is zero
##  HA: Difference in true means is non-zero
##
##  Test statistic: t(98.573) = 2.5265
##  P-value: 0.01311
##
##  Parameter: Difference in means (no - yes)
##  Point estimate: 0.012498
##  Confidence interval: [0.0026821, 0.0223140]

Is the p-value significant? Which group has a lower unemployment rate? How would you describe the results of this study?

## Robustness

The tail function shows the last few rows of your dataset:

tail(pay)
## # A tibble: 6 x 4
##   major                    sciences unemployment median_pay
##   <chr>                    <chr>           <dbl>      <dbl>
## 1 COMPOSITION AND RHETORIC no             0.0817      27000
## 2 ZOOLOGY                  yes            0.0463      26000
## 3 EDUCATIONAL PSYCHOLOGY   no             0.0651      25000
## 4 CLINICAL PSYCHOLOGY      no             0.149       25000
## 5 COUNSELING PSYCHOLOGY    no             0.0536      23400
## 6 LIBRARY SCIENCE          no             0.105       22000

Library science majors do not make very much money. What if someone accidentally wrote down the pay of 22k per year as 22 million per year? We can change this one value with the following R code:

pay\$median_pay[173] <- 22000000

Now re-run the T-test here predicting the median pay as a result of whether a major is in the sciences:

tmod_t_test(median_pay ~ sciences, data = pay)
##
## Two Sample t-test
##
##  H0: Difference in true means is zero
##  HA: Difference in true means is non-zero
##
##  Test statistic: t(119.03) = 0.93151
##  P-value: 0.3535
##
##  Parameter: Difference in means (no - yes)
##  Point estimate: 170500
##  Confidence interval: [-191940,  532950]

Is the test still significant? Now, re-run the Mann-Whitney test:

tmod_mann_whitney_test(median_pay ~ sciences, data = pay)
##
## Mann-Whitney rank sum test with continuity correction
##
##  H0: True location shift is equal to zero
##  HA: True location shift is not equal to zero
##
##  Test statistic: W = 1270.5
##  P-value: 3.039e-10

Is this test still significant? How much does the test statistic change compared to the original dataset? Which of the two tests is more robust to one bad data point?