Diamonds Data - Simple Regression

Today, we will start by looking at a dataset of diamond prices:

diamonds <- read_csv("https://statsmaths.github.io/stat_data/diamonds.csv")
## Parsed with column specification:
## cols(
##   carat = col_double(),
##   cut = col_character(),
##   color = col_character(),
##   clarity = col_character(),
##   depth = col_double(),
##   table = col_double(),
##   price = col_integer(),
##   x = col_double(),
##   y = col_double(),
##   z = col_double()
## )

Fit a linear regression predicting the diamond price as a function of its weight (carats). Print out the regression table without any confidence intervals.

model <- lm_basic(price ~ carat, data = diamonds)
reg_table(model)
## 
## Call:
## lm_basic(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate
## (Intercept)    -2256
## carat           7756
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

How would you interpret the slope coefficient in this model?

Answer: The change in price due to a one carat increase in the size of the diamond.

Next, show the same regression as a scatter plot with a smoothing line:

ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  geom_bestfit()

Now, print the same model as in the previous question but add a 95% confidence interval.

model <- lm_basic(price ~ carat, data = diamonds)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate 2.5 % 97.5 %
## (Intercept)    -2256 -2282  -2231
## carat           7756  7729   7784
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

What range of values does the confidence interval give for the intercept in this model?

Answer: from -2282 to -2231

What range of values does the confidence interval give for the slope in this model?

Answer: from 7729 to 7784

Would you be surprised if we collected another set of diamonds, in exactly the same way as before, and found a slope estimate equal to 5500?

Answer: Yes, because it is not in the current confidence interval.

Would you be surprised if we collected another set of diamonds, in exactly the same way as before, and found a slope estimate equal to 7500?

Answer: No, because it is in the current confidence interval.

The biggest use of confidence intervals, at least for us, is to determine whether the confidence interval for a slope contains only values that are all negative, all positive, or a mix of both.

Why might this be important? Let’s look at our diamond example. Describe the meaning of the slope in this linear regression:

Answer: The change in price due to a one carat increase in the size of the diamond.

Therefore, if the confidence interval has only positive values, we can fairly confidently say is that there is a positive relationship between diamond weight and its price.

Now, fit a regression model on the diamonds dataset, predicting the price as a function of depth. Compute the regression table using level equal to .95:

model <- lm_basic(price ~ depth, data = diamonds)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = price ~ depth, data = diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3766  -2986  -1521   1396  14937 
## 
## Coefficients:
##             Estimate   2.5 %  97.5 %
## (Intercept)  5763.67 4312.17 7215.16
## depth         -29.65  -53.15   -6.15
## 
## Residual standard error: 3989 on 53938 degrees of freedom
## Multiple R-squared:  0.0001134,  Adjusted R-squared:  9.483e-05 
## F-statistic: 6.115 on 1 and 53938 DF,  p-value: 0.0134

What does the model predict will be the price of a diamond with a depth of 60?

Answer: 5763.67 + (-29.65) * 60 = 3984.67

Describe in words the estimate of the slope in terms of the dataset.

Answer: Its the increase (actually, decrease) in price for every extra mm of depth that the diamond has.

Using the regression table from the previous question, does the confidence interval contain only values that are positive, only values that are negative, or values that are both positive and negative?

Answer: It contains only negative values.

Interpret this in words (your words, not just the formal definition).

Answer: There is strong statistical evidence for a negative relationship between diamond depth and price.

Diamonds Data - Multiple Linear Regression

While we lose the nice graphical summary, it is possible to build linear models with more than one variable and an intercept. In the diamond example, for instance, we can fit a model that has both depth and carat in it:

model <- lm_basic(price ~ 1 + carat + depth, data = diamonds)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = price ~ 1 + carat + depth, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18238.9   -801.6    -19.6    546.3  12683.7 
## 
## Coefficients:
##             Estimate  2.5 %  97.5 %
## (Intercept)   4045.3 3484.4 4606.30
## carat         7765.1 7737.7 7792.60
## depth         -102.2 -111.3  -93.08
## 
## Residual standard error: 1542 on 53937 degrees of freedom
## Multiple R-squared:  0.8507, Adjusted R-squared:  0.8507 
## F-statistic: 1.536e+05 on 2 and 53937 DF,  p-value: < 2.2e-16

Notice that switching the order of the inputs does not change the estimates or confidence intervals; only the order in the output is modified, but this is inconsequential:

model <- lm_basic(price ~ 1 + depth + carat, data = diamonds)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = price ~ 1 + depth + carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18238.9   -801.6    -19.6    546.3  12683.7 
## 
## Coefficients:
##             Estimate  2.5 %  97.5 %
## (Intercept)   4045.3 3484.4 4606.30
## depth         -102.2 -111.3  -93.08
## carat         7765.1 7737.7 7792.60
## 
## Residual standard error: 1542 on 53937 degrees of freedom
## Multiple R-squared:  0.8507, Adjusted R-squared:  0.8507 
## F-statistic: 1.536e+05 on 2 and 53937 DF,  p-value: < 2.2e-16

Using the model in the previous question, what do you expect will be the price of a diamond with a depth of 50 that weights 2 carats?

Answer: 4045.3 + (-102.2) * 50 + (7765.1) * 2 = 1.4465510^{4}

Are both of the slopes in the model statistically significant? What are their signs?

Answer: Yes. The carats variable is positive and significant. The depth variable is negative and significant.

Use the add_prediction function to add predictions from this model back into the diamonds dataset.

diamonds <- add_prediction(diamonds, model)
max(diamonds$model_resid)
## [1] 12683.65

What is the largest positive residual (you can either use the max function or sort the data in the data viewer)?

Answer: The largest postitive residual is equal to 12731.68.

Interpreting the meaning of the slopes in a regression model with multiple variables changes slightly from the version with a single slope. Each slope measures the change in the response give a change in its corresponding value, when holding the other variables in the model fixed. This last part is actually important and we will see some examples that at first seem counterintuitive that make sense once we remember this caveat.

Describe the meaning the slopes, in your own words, in the regression model above.

Answer: Its the expected change in price for a change in a diamonds depth, with the weight of the diamond held constant. Or, for the second coefficent, the expected change in the price of a diamond as a function of its weight with the depth variable held fixed.

Notice that we can do some comparisons quickly with the slope estimates. For example, if we have two diamonds with the same depth where one is 1 carat in weight and the other is 2 carats in weight, how much more do you expect the heavier diamond to cost?

Answer: The difference is just the carat variable slope, 7765.1.