Let’s briefly look at a well-known, built-in dataset in R that records the fuel efficiency of a number of different types of cars. The data has the average miles per gallon driving in both the city (lots of stops and slower speeds) and on the highway.

mpg <- select(ggplot2::mpg, manufacturer, model, year, city = cty, highway = hwy)
mpg
## # A tibble: 234 Ă— 5
##    manufacturer model       year  city highway
##    <chr>        <chr>      <int> <int>   <int>
##  1 audi         a4          1999    18      29
##  2 audi         a4          1999    21      29
##  3 audi         a4          2008    20      31
##  4 audi         a4          2008    21      30
##  5 audi         a4          1999    16      26
##  6 audi         a4          1999    18      26
##  7 audi         a4          2008    18      27
##  8 audi         a4 quattro  1999    18      26
##  9 audi         a4 quattro  1999    16      25
## 10 audi         a4 quattro  2008    20      28
## # ℹ 224 more rows

What is the relationship between the city and highway fuel efficiency? If we plot it, we see that the pattern closely follows a roughly linear relationship between the two:

mpg |>
  ggplot(aes(city, highway)) +
    geom_point()

We can visually add a linear regression line with the following:

mpg |>
  ggplot(aes(city, highway)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, formula = 'y ~ x')

If we want to learn the actual parameters of the regression model, the slope and intercept, we need to run the lm function directly and summarize the output. Here is what that looks like:

model <- lm(highway ~ city, data = mpg)
summary(model)
## 
## Call:
## lm(formula = highway ~ city, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3408 -1.2790  0.0214  1.0338  4.0461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.89204    0.46895   1.902   0.0584 .  
## city         1.33746    0.02697  49.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared:  0.9138, Adjusted R-squared:  0.9134 
## F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

The table gives the point estimate, the standard error (that’s the part of the confidence interval without the t-critical value), the T-statistic testing the null hypothesis that the value is zero, and the corresponding p-value. To directly get confidence intervals, we can use the confint function as follows:

confint(model, level = 0.02)
##                  49 %      51 %
## (Intercept) 0.8802725 0.9038097
## city        1.3367787 1.3381325

If you want to run a regression without an intercept, as on the worksheet, we can do this by adding a negative one to the argument of the lm function:

model <- lm(highway ~ city - 1, data = mpg)
summary(model)
## 
## Call:
## lm(formula = highway ~ city - 1, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8419 -1.0337 -0.1314  1.0302  4.1918 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## city 1.387210   0.006626   209.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.762 on 233 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9947 
## F-statistic: 4.383e+04 on 1 and 233 DF,  p-value: < 2.2e-16

This is not as common as running one with an intercept. We will see more ways of extending this technique next time, as well as seeing how to interpret the bottom of the summary table.