Let’s briefly look at a well-known, built-in dataset in R that records the fuel efficiency of a number of different types of cars. The data has the average miles per gallon driving in both the city (lots of stops and slower speeds) and on the highway.
## # A tibble: 234 Ă— 5
## manufacturer model year city highway
## <chr> <chr> <int> <int> <int>
## 1 audi a4 1999 18 29
## 2 audi a4 1999 21 29
## 3 audi a4 2008 20 31
## 4 audi a4 2008 21 30
## 5 audi a4 1999 16 26
## 6 audi a4 1999 18 26
## 7 audi a4 2008 18 27
## 8 audi a4 quattro 1999 18 26
## 9 audi a4 quattro 1999 16 25
## 10 audi a4 quattro 2008 20 28
## # ℹ 224 more rows
What is the relationship between the city and highway fuel efficiency? If we plot it, we see that the pattern closely follows a roughly linear relationship between the two:
We can visually add a linear regression line with the following:
mpg |>
ggplot(aes(city, highway)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, formula = 'y ~ x')
If we want to learn the actual parameters of the regression model,
the slope and intercept, we need to run the lm
function
directly and summarize the output. Here is what that looks like:
##
## Call:
## lm(formula = highway ~ city, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3408 -1.2790 0.0214 1.0338 4.0461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.89204 0.46895 1.902 0.0584 .
## city 1.33746 0.02697 49.585 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared: 0.9138, Adjusted R-squared: 0.9134
## F-statistic: 2459 on 1 and 232 DF, p-value: < 2.2e-16
The table gives the point estimate, the standard error (that’s the
part of the confidence interval without the t-critical value), the
T-statistic testing the null hypothesis that the value is zero, and the
corresponding p-value. To directly get confidence intervals, we can use
the confint
function as follows:
## 49 % 51 %
## (Intercept) 0.8802725 0.9038097
## city 1.3367787 1.3381325
If you want to run a regression without an intercept, as on the
worksheet, we can do this by adding a negative one to the argument of
the lm
function:
##
## Call:
## lm(formula = highway ~ city - 1, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8419 -1.0337 -0.1314 1.0302 4.1918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## city 1.387210 0.006626 209.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.762 on 233 degrees of freedom
## Multiple R-squared: 0.9947, Adjusted R-squared: 0.9947
## F-statistic: 4.383e+04 on 1 and 233 DF, p-value: < 2.2e-16
This is not as common as running one with an intercept. We will see more ways of extending this technique next time, as well as seeing how to interpret the bottom of the summary table.