# Class 07: Reading Data and Statistical Tests

### 2017-09-19

## Statistical Tests

Today we turn our attention to modeling, which complements the graphical and data manipulation techniques that we have so far covered.

You should all be familiar with the structure of statistical hypothesis testing and confidence intervals. If not, the following is a good sources for a brief review or catch-up:

Today I will very quickly review this material in the context of the dataset we just constructed.

### One-sample inference

Let’s look at the mpg data set once again.

Assuming that this was constructed as a random sample from all available
car models, we could construt a confidence interval for what the mean
city fuel efficency of all cars on the market are by using the function
`t.test`

.

We see that the 95% confidence interval is from 16.31 to 17.41 miles per gallon. If this is actually a uniform random sample from some population, and the true distribution is not too highly skewed, this procedure is guaranteed to give a range containing the true mean 95% of the time.

The `t.test`

function also runs an hypothesis test for us. If the
true average value of city fuel efficency was 0 miles per gallon,
we would have observed a sample with this large a sample mean less
that 2.2e-16 of the time (the p-value above). Of course, testing
for fuel efficency being zero is ridiculous. It has to be a non-negative
value for any functioning car. Let’s instead test for how rare these
results are if the “true mean” was 16:

Here, we would expect a sample as *extreme* as this one only 0.226% of
the time if the true mean were 16 mile per gallon.

### Two-sample inference

We can also use the `t.test`

function to compare the means of
two different groups. For example, is there a difference between
the city fuel efficency of a randomly choosen Subaru car and a randomly
choosen Toyota car. We can make two different datasets to compare
this using our `filter`

function:

And then passing both variables to the `t.test`

function:

Here, the default null-hypothesis is that there is not difference between the two fuel efficencies. We see that the p-value is only 0.31, so 31% of the time we would get these results if there is no difference between the fuel efficency of a randomly choosen Toyota automobile and randomly choosen Subaru automobile. The confidence interval for the difference in fuel efficencies is -0.73 to 2.24 miles per gallon. Notice that this range includes zero, indicating our lack of confidence in there being any difference at all.

### Simple linear regression

Let’s assume that there is a linear relationship between the city fuel efficency and highway fuel efficency of any randomly choosen automobile. Specifically, assume that:

Where alpha and beta are constant and epsilon is some random noise that has a zero mean, is not correlated with the highway fuel efficency, and is independent for each sampled car. We can represent this relationship graphically with our tools at hand:

We can model this more formally in R with the `lm`

function:

The model object gives us the predicted slope and intercept. More
on this later. If we apply the `summary`

function (not `summarize`

),
we will get T-scores and p-values, which run hypothesis test for each
coefficents being equal to zero:

And the `confint`

function gives confidence intervals for each
parameter.

## Data Collection

Today’s lab has you create a small dataset and apply these modelling techniques on your dataset. Start by opening some spreadsheet software on your machine. This can be Excel, GoogleSheets, OpenOffice, or anything else that you are comfortable with.

Create a new spreadsheet. The first step in collecting data is decided what variables are going to be collected and what we want to call them. Here, we are going to collect data about city populations. The variables we want to store are:

**city_name**: the name of the city**pop_guess**: your guess for how many people live in the city**pop_actual**: the actual population**east_coast**: either “1” for cities on the East coast or “0” otherwise

The cities we are going to put into the model are:

- Raleigh, NC
- Washington, DC
- New York, NY
- Richmond, VA
- Detroit, MI
- Sacramento, CA
- Aurora, CO
- Minneapolis, MN
- Glendale, AZ
- Seattle, WA

For reference, Boston has 673,184 people and Norfolk Virginia has 246,393 people. Put in your guesses and mark the first four cities as being on the East Coast. I will give you the actual values in just a moment.

## Data Export and Import

Once you have finished inputing the dataset, export the document as a csv
file. I recommend saving it as “city_data.csv” and storing in on your
desktop. Then, go into R and read the dataset in with the `read_csv`

function.
You may need to change this if you saved the file:

Make sure that all of the variables loaded in correctly and have the correct data type.

We are going to use the data you just constructed to investigate four questions:

- *what is your prediction for the actual average population of all 150 of the most populous US cities** (this was a random sample from these 150 cities)
*test the hypothesis that the ratio between your guess and the actual population is equal to 1*(you’ll need to make a new variable in R)*test whether the ratio between your guess and the actual population are different based on which coast the city is on*(you’ll probably need`filter`

here)*fit a regression model between the actual city population and the ratio between your guess and the city population*do you tend to do better predicting small or larger cities?

There is nothing particularly tricky here. Just apply the four tests I have mentioned in the notes today.