We will start today by looking at a small dataset containing teacher salaries from 2009-2010 for 71 randomly choosen teachers employed by the St. Louis Public School in in Michigan.

`teachers <- read_csv("https://statsmaths.github.io/stat_data/teachers_pay.csv")`

```
## Parsed with column specification:
## cols(
## base = col_integer(),
## degree = col_character(),
## years = col_double()
## )
```

The available variables are

- base: Base annual salary, in dollars
- degree: Highest educational degree attained: BA (Bachelor’s) or MA (Master’s)
- years: Number of years employeed

Using the mean function, what is the average base pay of all teachers in the dataset?

`mean(teachers$base)`

`## [1] 56937.61`

Fit a model for the mean of the base pay variable using lm_basic. Save the model as an object called “model”:

`model <- lm_basic(base ~ 1, data = teachers)`

Using a call to `reg_table`

, find the mean implied by the model:

`reg_table(model)`

```
##
## Call:
## lm_basic(formula = base ~ 1, data = teachers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21511 -5764 2976 8422 11292
##
## Coefficients:
## Estimate
## (Intercept) 56938
##
## Residual standard error: 9029 on 69 degrees of freedom
```

Does the mean agree with your answer to question 2?

**Answer**: Yes.

Add a 95% confidence interval to the regression table.

`reg_table(model, level = 0.95)`

```
##
## Call:
## lm_basic(formula = base ~ 1, data = teachers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21511 -5764 2976 8422 11292
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 56938 54785 59091
##
## Residual standard error: 9029 on 69 degrees of freedom
```

What is the range of mean salaries implied by the confidence interval?

**Answer**:

Draw a histogram of the base salary values for the entire dataset.

```
ggplot(teachers, aes(base)) +
geom_histogram(color = "black", fill = "white", bins = 20)
```

Do most of the salary values fall within the range given in question 5? Why or why not?

**Answer**: No, because the confidence interval is trying to capture the mean, not the data.

Use the filter command to construct a new dataset called `masters`

consisting of just those teachers with a masters degree.

`masters <- filter(teachers, degree == "MA")`

Compute a 95% confidence interval for the mean pay of teachers with a master’s degree. Does this range intersect the one you had in question 5?

```
model <- lm_basic(base ~ 1, data = masters)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = base ~ 1, data = masters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16098.7 -5673.2 958.3 5698.8 10436.3
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 57794 54890 60697
##
## Residual standard error: 7915 on 30 degrees of freedom
```

**Answer**: The range intersects, but is not equivalent, to the range in the model with all teachers.

Now load the following dataset containing all murders that have occurred in London from 1 January 2006 to 7 September 7 2011.

`london <- read_csv("https://statsmaths.github.io/stat_data/london_murders.csv")`

```
## Parsed with column specification:
## cols(
## age = col_integer(),
## year = col_integer(),
## borough = col_character()
## )
```

The available variables are:

- age: age of the victim in years
- year: year of the murder
- borough: the London borough in which the murder took place

Find an 80% confidence interval for the average age of the victim of a murder in London.

```
model <- lm_basic(age ~ 1, data = london)
reg_table(model, level = 0.8)
```

```
##
## Call:
## lm_basic(formula = age ~ 1, data = london)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.858 -12.858 -3.858 10.142 66.142
##
## Coefficients:
## Estimate 10 % 90 %
## (Intercept) 33.86 33.07 34.65
##
## Residual standard error: 17.85 on 837 degrees of freedom
```

Make sure you actually extract the answer here:

**Answer**: The interval is from 33.07 to 34.65 years.

Describe in words how the proceeding confidence interval should be interpreted.

**Answer**: We are using a procedure that, if applied to multiple datasets, would capture the true mean 80% of the time. This procedure found a range from 33.07 to 34.65 years.