Class 07: Reading Data and Statistical Tests

2017-09-19

library(readr)
library(ggplot2)
library(dplyr)
library(viridis)

Statistical Tests

Today we turn our attention to modeling, which complements the graphical and data manipulation techniques that we have so far covered.

You should all be familiar with the structure of statistical hypothesis testing and confidence intervals. If not, the following is a good sources for a brief review or catch-up:

Today I will very quickly review this material in the context of the dataset we just constructed.

One-sample inference

Let’s look at the mpg data set once again.

mpg <- read_csv("https://statsmaths.github.io/stat_data/mpg.csv")

Assuming that this was constructed as a random sample from all available car models, we could construt a confidence interval for what the mean city fuel efficency of all cars on the market are by using the function t.test.

t.test(mpg$cty)
## 
## 	One Sample t-test
## 
## data:  mpg$cty
## t = 60.596, df = 233, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  16.31083 17.40712
## sample estimates:
## mean of x 
##  16.85897

We see that the 95% confidence interval is from 16.31 to 17.41 miles per gallon. If this is actually a uniform random sample from some population, and the true distribution is not too highly skewed, this procedure is guaranteed to give a range containing the true mean 95% of the time.

The t.test function also runs an hypothesis test for us. If the true average value of city fuel efficency was 0 miles per gallon, we would have observed a sample with this large a sample mean less that 2.2e-16 of the time (the p-value above). Of course, testing for fuel efficency being zero is ridiculous. It has to be a non-negative value for any functioning car. Let’s instead test for how rare these results are if the “true mean” was 16:

t.test(mpg$cty, mu = 16)
## 
## 	One Sample t-test
## 
## data:  mpg$cty
## t = 3.0874, df = 233, p-value = 0.002264
## alternative hypothesis: true mean is not equal to 16
## 95 percent confidence interval:
##  16.31083 17.40712
## sample estimates:
## mean of x 
##  16.85897

Here, we would expect a sample as extreme as this one only 0.226% of the time if the true mean were 16 mile per gallon.

Two-sample inference

We can also use the t.test function to compare the means of two different groups. For example, is there a difference between the city fuel efficency of a randomly choosen Subaru car and a randomly choosen Toyota car. We can make two different datasets to compare this using our filter function:

subaru <- filter(mpg, manufacturer == "subaru")
toyota <- filter(mpg, manufacturer == "toyota")

And then passing both variables to the t.test function:

t.test(subaru$cty, toyota$cty)
## 
## 	Welch Two Sample t-test
## 
## data:  subaru$cty and toyota$cty
## t = 1.0279, df = 40.118, p-value = 0.3102
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7306081  2.2432131
## sample estimates:
## mean of x mean of y 
##  19.28571  18.52941

Here, the default null-hypothesis is that there is not difference between the two fuel efficencies. We see that the p-value is only 0.31, so 31% of the time we would get these results if there is no difference between the fuel efficency of a randomly choosen Toyota automobile and randomly choosen Subaru automobile. The confidence interval for the difference in fuel efficencies is -0.73 to 2.24 miles per gallon. Notice that this range includes zero, indicating our lack of confidence in there being any difference at all.

Simple linear regression

Let’s assume that there is a linear relationship between the city fuel efficency and highway fuel efficency of any randomly choosen automobile. Specifically, assume that:

Where alpha and beta are constant and epsilon is some random noise that has a zero mean, is not correlated with the highway fuel efficency, and is independent for each sampled car. We can represent this relationship graphically with our tools at hand:

ggplot(mpg, aes(hwy, cty)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal()

plot of chunk unnamed-chunk-7

We can model this more formally in R with the lm function:

model <- lm(cty ~ hwy, data = mpg)
model
## 
## Call:
## lm(formula = cty ~ hwy, data = mpg)
## 
## Coefficients:
## (Intercept)          hwy  
##      0.8442       0.6832

The model object gives us the predicted slope and intercept. More on this later. If we apply the summary function (not summarize), we will get T-scores and p-values, which run hypothesis test for each coefficents being equal to zero:

summary(model)
## 
## Call:
## lm(formula = cty ~ hwy, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9247 -0.7757 -0.0428  0.6965  4.6096 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.84420    0.33319   2.534   0.0119 *  
## hwy          0.68322    0.01378  49.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.252 on 232 degrees of freedom
## Multiple R-squared:  0.9138,	Adjusted R-squared:  0.9134 
## F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

And the confint function gives confidence intervals for each parameter.

confint(model)
##                 2.5 %    97.5 %
## (Intercept) 0.1877296 1.5006736
## hwy         0.6560715 0.7103667

Data Collection

Today’s lab has you create a small dataset and apply these modelling techniques on your dataset. Start by opening some spreadsheet software on your machine. This can be Excel, GoogleSheets, OpenOffice, or anything else that you are comfortable with.

Create a new spreadsheet. The first step in collecting data is decided what variables are going to be collected and what we want to call them. Here, we are going to collect data about city populations. The variables we want to store are:

  • city_name: the name of the city
  • pop_guess: your guess for how many people live in the city
  • pop_actual: the actual population
  • east_coast: either “1” for cities on the East coast or “0” otherwise

The cities we are going to put into the model are:

  • Raleigh, NC
  • Washington, DC
  • New York, NY
  • Richmond, VA
  • Detroit, MI
  • Sacramento, CA
  • Aurora, CO
  • Minneapolis, MN
  • Glendale, AZ
  • Seattle, WA

For reference, Boston has 673,184 people and Norfolk Virginia has 246,393 people. Put in your guesses and mark the first four cities as being on the East Coast. I will give you the actual values in just a moment.

Data Export and Import

Once you have finished inputing the dataset, export the document as a csv file. I recommend saving it as “city_data.csv” and storing in on your desktop. Then, go into R and read the dataset in with the read_csv function. You may need to change this if you saved the file:

email <- read_csv("city_data.csv")

Make sure that all of the variables loaded in correctly and have the correct data type.

We are going to use the data you just constructed to investigate four questions:

  • *what is your prediction for the actual average population of all 150 of the most populous US cities** (this was a random sample from these 150 cities)
  • test the hypothesis that the ratio between your guess and the actual population is equal to 1 (you’ll need to make a new variable in R)
  • test whether the ratio between your guess and the actual population are different based on which coast the city is on (you’ll probably need filter here)
  • fit a regression model between the actual city population and the ratio between your guess and the city population do you tend to do better predicting small or larger cities?

There is nothing particularly tricky here. Just apply the four tests I have mentioned in the notes today.