I have produced some fixes to the tmodels package and you need to re-install it for today’s lab to work. Do that by running the following before you do anything else today:

# devtools::install_github("statsmaths/tmodels")

Then, read in all of the R packages that we will need for today:

library(dplyr)
library(ggplot2)
library(tmodels)
library(readr)
library(readxl)

Diamonds Dataset

For the first part of this lab, I want you to read into R a dataset of diamond prices, along with various characteristics of the diamonds:

diamonds <- read_csv("https://raw.githubusercontent.com/statsmaths/stat_data/gh-pages/diamonds_small.csv")

We are going to run a number of regression analyses to understand how various features effect the price of a diamond. Unless otherwise noted, all of the models use price as the response variable.

  1. Start by running a regression model that predicts price as a function of carat (the weight of the diamond). Take note of whether the result is significant or not.
tmod_linear_regression(price ~ carat, data = diamonds)
## 
## Linear regression; T-Test
## 
##  H0: Change in conditional mean is zero
##  HA: Change in conditional mean is non-zero
## 
##  Test statistic: t = 70.966
##  P-value: < 2.2e-16
## 
##  Parameter: change in price for unit change in carat
##              controlling for -- 
##  Point estimate: 7644.2
##  Confidence interval: [7432.9, 7855.6]

Yes, the result is significant (p-value is < 2.2e-16).

  1. Repeat the same analysis, but this time use the Pearson Correlation Test. Verify that the T-statistic is the same value as for linear regression.
tmod_pearson_correlation_test(price ~ carat, data = diamonds)
## 
## Pearson's product-moment correlation test
## 
##  H0: True correlation is zero
##  HA: True correlation is non-zero
## 
##  Test statistic: t(998) = 70.966
##  P-value: < 2.2e-16
## 
##  Parameter: (Pearson) correlation coefficient
##  Point estimate: 0.91357
##  Confidence interval: [0.90270, 0.92327]

Are the point estimates and confidence intervals the same? Why or why not?

The p-value and T statistic are exactly the same, however the point estimates (7644.2 and 0.91357) are different. They measure different things, the slope of the regression line and the correlation, respectively.

  1. Now, build a linear regression that predicts price as a function of the “4 C’s”: carat, color, clarity and cut. Treat carat as the IV and the other three as nusiance variables (as a reminder, the order of the variable only matters in terms of which one goes first after the ~; otherwise there is no difference in the output).
tmod_linear_regression(price ~ carat + color + clarity + cut, data = diamonds)
## 
## Linear regression; T-Test
## 
##  H0: Change in conditional mean is zero
##  HA: Change in conditional mean is non-zero
## 
##  Test statistic: t = 97.074
##  P-value: < 2.2e-16
## 
##  Parameter: change in price for unit change in carat
##              controlling for -- color; clarity; cut
##  Point estimate: 8760.9
##  Confidence interval: [8583.8, 8938.0]

How does the estimate compare to the result without controlling for the other variables? Is the effect stronger or weaker? Does this make sense to you?

The effect size is larger (its now 8760.9 and was 7644.2). It means that a change in the size of the diamond is actually is more strongly related to price when controlling for the other variables. This is because, in general, larger diamond sizes are typically associated with worse cut/color/clarity.

  1. Now, run a regression model that predicts price as just a function of clarity.
tmod_linear_regression(price ~ clarity, data = diamonds)
## 
## Linear regression; F-Test
## 
##  H0: Difference in conditional mean is zero
##  HA: Difference in true means is non-zero
## 
##  Test statistic: F(1, 7) = 2.0553
##  P-value: 0.04578

Is the result significant at the 0.05 level? Notice how the output looks different than the first result because we now have a categorical variable as the input variable.

Yes, it is significant at the 0.05 level (p-value: 0.04578).

  1. Run a One-way ANOVA test predicting price as a function of clarity.
tmod_one_way_anova_test(price ~ clarity, data = diamonds)
## 
## One-way Analysis of Variance (ANOVA)
## 
##  H0: True means are the same in each group.
##  HA: True means are the same in each group.
## 
##  Test statistic: F(7, 992) = 2.0553
##  P-value: 0.04578

You should find that this has the exact same F-statistics and P-value (the numbers in the F-statistic are different, but lead to the same result).

Yes, they are exactly the same (p-value: 0.04578; statistic: 2.0553).

  1. Run a linear regression that predicts price as a function of clarity with cut as a nusiance variable.
tmod_linear_regression(price ~ clarity + cut, data = diamonds)
## 
## Linear regression; F-Test
## 
##  H0: Difference in conditional mean is zero
##  HA: Difference in true means is non-zero
## 
##  Test statistic: F(1, 7) = 1.4176
##  P-value: 0.1944

You should see that the test is now no longer significant. What does that mean in pratical terms?

It means that we do not have evidence that clarity effects the diamond price if we account for cut but not for the other variables.

  1. Run the price as a function of the 4 C’s again, but this time use clarity as the input variable.
tmod_linear_regression(price ~ clarity + carat + color + cut, data = diamonds)
## 
## Linear regression; F-Test
## 
##  H0: Difference in conditional mean is zero
##  HA: Difference in true means is non-zero
## 
##  Test statistic: F(1, 7) = 81.729
##  P-value: < 2.2e-16

You should see that this is once again significant. Try to summarize the results from questions 5-7 and take note of how tricky and difficult linear regression can be to use!

Now it is again significant. So, clarity effects the price on its own and when controlling for all of the nusiance variables, but not when we only acount for cut.

  1. There are no variables in the diamonds dataset that are categorical with only two categories, so we need to make one for you to see how this works with linear regression. Run the following to create a variable named color_good that breaks the colors into two distinct categories:
diamonds <- mutate(diamonds, color_good = if_else(color %in% c("D", "E"), "good", "poor"))

Run the code and see the new variable created in the data-viewer. Then, run a linear regression predicting price as a function of color_good:

tmod_linear_regression(price ~ color_good, data = diamonds)
## 
## Linear regression; T-Test
## 
##  H0: Difference in conditional mean is zero
##  HA: Difference in conditional mean is non-zero
## 
##  Test statistic: t = 3.0616
##  P-value: 0.00226
## 
##  Parameter: mean(price|poor) - mean(price|good)
##             after controlling for -- 
##  Point estimate: 747.29
##  Confidence interval: [ 268.31, 1226.30]

It is significant, but notice that it suggests that diamonds with poor color are more expensive than diamonds with good color.

  1. Run a T-sample t-test predicting price as a function of color_good:
tmod_t_test(price ~ color_good, data = diamonds)
## 
## Two Sample t-test
## 
##  H0: Difference in true means is zero
##  HA: Difference in true means is non-zero
## 
##  Test statistic: t(751.06) = -3.2428
##  P-value: 0.001236
## 
##  Parameter: Difference in means (good - poor)
##  Point estimate: -747.29
##  Confidence interval: [-1199.70,  -294.89]

How do the results compare to the regression analysis? Note that, for one thing, the defaults are switched so that the terms here are all negative. You should also see that the point estimate is the same, but the other values are slightly different.

What is going on? There is a slightly different assumption that the linear regression model uses compared to the two-sample T-test. The regression model assumes that the variation is the of price is the same in both groups and the two-sample T-test doesn’t. That is where the discrepency arises. This isn’t anything to worry about, but I just wanted you to have seen it.

  1. Finally, run a regression for price as a function of color_good
  1. and the nusiance variables carat, cut, and clarity:
tmod_linear_regression(price ~ color_good + carat + cut + clarity, data = diamonds)
## 
## Linear regression; T-Test
## 
##  H0: Difference in conditional mean is zero
##  HA: Difference in conditional mean is non-zero
## 
##  Test statistic: t = -7.4685
##  P-value: 1.783e-13
## 
##  Parameter: mean(price|poor) - mean(price|good)
##             after controlling for -- carat; cut; clarity
##  Point estimate: -609.67
##  Confidence interval: [-769.86, -449.47]

Notice that the output is different than in question 7 where there are multiple categories.

Regression in scientific literature

Open the following article from the British Medical Journal on “Geographical variation in glaucoma prescribing trends in England 2008–2012: an observationa ecological study”:

Read the Abstract and Methods sections. Also look at Tables 2-4. Then, try to answer the following questions (you may need to search through the rest of the paper for some of these):

  1. Is this an observational or experimental study? Observation (it even says so!)
  2. What statistical tests were used in this analysis? Are you familiar with all of them from the course? If not, what tests or statistical terms are you unfamiliar with? I found Spearman’s rank correlation and multivariate regression. There was a lot of Biological jargon but you should have understood most of the statistics.