## Teacher Salary

We will start today by looking at a small dataset containing teacher salaries from 2009-2010 for 71 randomly choosen teachers employed by the St. Louis Public School in in Michigan.

teachers <- read_csv("https://statsmaths.github.io/stat_data/teachers_pay.csv")
## Parsed with column specification:
## cols(
##   base = col_integer(),
##   degree = col_character(),
##   years = col_double()
## )

The available variables are

• base: Base annual salary, in dollars
• degree: Highest educational degree attained: BA (Bachelor’s) or MA (Master’s)
• years: Number of years employeed

Using the mean function, what is the average base pay of all teachers in the dataset?

mean(teachers\$base)
## [1] 56937.61

Fit a model for the mean of the base pay variable using lm_basic. Save the model as an object called “model”:

model <- lm_basic(base ~ 1, data = teachers)

Using a call to reg_table, find the mean implied by the model:

reg_table(model)
##
## Call:
## lm_basic(formula = base ~ 1, data = teachers)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -21511  -5764   2976   8422  11292
##
## Coefficients:
##             Estimate
## (Intercept)    56938
##
## Residual standard error: 9029 on 69 degrees of freedom

Add a 95% confidence interval to the regression table.

reg_table(model, level = 0.95)
##
## Call:
## lm_basic(formula = base ~ 1, data = teachers)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -21511  -5764   2976   8422  11292
##
## Coefficients:
##             Estimate 2.5 % 97.5 %
## (Intercept)    56938 54785  59091
##
## Residual standard error: 9029 on 69 degrees of freedom

What is the range of mean salaries implied by the confidence interval?

Draw a histogram of the base salary values for the entire dataset.

ggplot(teachers, aes(base)) +
geom_histogram(color = "black", fill = "white", bins = 20)

Do most of the salary values fall within the range given in question 5? Why or why not?

Answer: No, because the confidence interval is trying to capture the mean, not the data.

Use the filter command to construct a new dataset called masters consisting of just those teachers with a masters degree.

masters <- filter(teachers, degree == "MA")

Compute a 95% confidence interval for the mean pay of teachers with a master’s degree. Does this range intersect the one you had in question 5?

model <- lm_basic(base ~ 1, data = masters)
reg_table(model, level = 0.95)
##
## Call:
## lm_basic(formula = base ~ 1, data = masters)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -16098.7  -5673.2    958.3   5698.8  10436.3
##
## Coefficients:
##             Estimate 2.5 % 97.5 %
## (Intercept)    57794 54890  60697
##
## Residual standard error: 7915 on 30 degrees of freedom

Answer: The range intersects, but is not equivalent, to the range in the model with all teachers.

## Murder Data

Now load the following dataset containing all murders that have occurred in London from 1 January 2006 to 7 September 7 2011.

london <- read_csv("https://statsmaths.github.io/stat_data/london_murders.csv")
## Parsed with column specification:
## cols(
##   age = col_integer(),
##   year = col_integer(),
##   borough = col_character()
## )

The available variables are:

• age: age of the victim in years
• year: year of the murder
• borough: the London borough in which the murder took place

Find an 80% confidence interval for the average age of the victim of a murder in London.

model <- lm_basic(age ~ 1, data = london)
reg_table(model, level = 0.8)
##
## Call:
## lm_basic(formula = age ~ 1, data = london)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -32.858 -12.858  -3.858  10.142  66.142
##
## Coefficients:
##             Estimate  10 %  90 %
## (Intercept)    33.86 33.07 34.65
##
## Residual standard error: 17.85 on 837 degrees of freedom

Make sure you actually extract the answer here:

Answer: The interval is from 33.07 to 34.65 years.

Describe in words how the proceeding confidence interval should be interpreted.

Answer: We are using a procedure that, if applied to multiple datasets, would capture the true mean 80% of the time. This procedure found a range from 33.07 to 34.65 years.