In this tutorial we learn how to build inferential statistical models
using the `statsmodels`

module. Start by loading the module as well as
pandas, matplotlib, and iplot.

In [1]:

```
%matplotlib inline
import matplotlib as mpl
import pandas as pd
import statsmodels.formula.api as smf
import iplot
assert iplot.__version__ >= 1
```

In [2]:

```
mpl.rcParams['figure.figsize'] = (20.0, 10.0)
```

These notes give, in just a few minutes, a quick overview of basic inferential statistics that would typically cover 3-4 weeks in an introduction to statistics course. I assume, as it is given as an informal prerequisite for this course, that you have all seen this material before and so this is just given as a refresher without getting too much into the details.

Linear models are arguably the most well known and often used methods for modeling data. They are employed to model the outcomes of patients in clinical trials, the price of financial instruments, the lifetimes of fruit flies, and many other responses from a wide range of fields. Their popularity is not unwarranted. In fact, the discussion of linear models and their variants take up a considerable portion of this text.

Consider observing $n$ pairs of data $(x_i, y_i)$. A simple linear model would assume that the data are generated according to the equation

$$ y_i = \beta_0 + \beta_1 x_i + \epsilon_i $$

Where $\epsilon_i$ is some unobserved error term and the $\beta_j$'s are unknown constants. Geometrically, this is a line with a specific intercept and slope. The goal of statistical modeling is to use the observed data to, in some fashion, estimate the parameters $\beta_0$ and $\beta_1$.

Let's read in a small dataset to show visually how this works. This is another classic example often used in statitics courses; it gives values about certain makes and models of cars. (Note: I need to create a new name for the variable 'class' because it conflicts with a keyword in Python)

In [3]:

```
df = pd.read_csv("https://statsmaths.github.io/stat_data/mpg.csv")
df['class_x'] = df['class']
df.head()
```

Out[3]:

Let's draw a scatter plot of the variable `hwy`

(miles per gallon
on the highway) and `cty`

(miles per gallon in city road conditions).

In [4]:

```
p = iplot.create_figure(df, 'hwy', 'cty')
iplot.show(p)
```

Notice that the data appear to follow a linear trend: as one variable increases so does the other. Let's try to put a line segement directly on the plot (I found the best line by trial and error).

In [5]:

```
p.segment(12, 9, 45, 32, color='red')
iplot.show(p)
```

Trial and error is not a great way to figure out the best line to run through a plot. Instead, we want an precise algorithm that attempts to optimize some metric. The most common example is the ordinary least squares (OLS) method; this finds the slope and intercept that minimizes the squared residuals (amount the line misses the y-variable):

$$ \widehat{\beta} = \text{argmin}_b \left\{ \sum_i \left( y_i - b_0 - b_1 \cdot x_i \right)^2 \right\} $$

We will use the `statsmodels`

module to detect the ordinary least squares estimator
using `smf.ols`

. Here, create a model that predicts a line estimating the city miles
per gallon variable as a function of the highway variable.

In [6]:

```
model = smf.ols(formula="cty ~ hwy", data=df)
model
```

Out[6]:

We need to actually fit the model to the data using the `fit`

method. Printing
the result shows a lot of information!

In [7]:

```
result = model.fit()
print(result.summary())
```

The `coef`

column in the middle gives the predicted values of the regression. Here,
the model predicts that the relationship is given by:

$$ \text{cty}_i = 0.8442 + 0.6832 \cdot \text{hwy}_i + \epsilon_i $$

Finally, we can add the predicted values back into the pandas dataframe using the
`predict`

method and then show the predicted values.

In [8]:

```
df['pred'] = result.predict()
```

In [9]:

```
p = iplot.create_figure(df, 'hwy', 'pred')
iplot.show(p)
```

Notice that the values all fall along a line.

We can build regression models that use multiple variables to estimate the response. To do this, add (literally) variables to the right hand side of the formula object as seen below (displacement is the size of the engine):

In [10]:

```
model = smf.ols(formula="cty ~ hwy + displ", data=df)
result = model.fit()
print(result.summary())
```

The model now predicts that the relationship is given by:

$$ \text{cty}_i = 4.7368 + 0.5954 \cdot \text{hwy}_i -0.5283 \cdot \text{displ}_i + \epsilon_i $$

This is a plane rather than a line, and harder to visualize. It is also more difficult to reason about. But, multiple regressions are powerful tools and will come up frequently in this course and in other applications that you see across statistics and data science.

In the above example, we tried to fit a linear regression to some example data. Ordinary least squares was used to estimate the best value of the slopes and intercept. Statistical inference goes one step further by describing how confident we are in our guesses. In order to do this, it makes several assumptions about how the data are distributed. These are:

- the relationship between the response $y_i$ and the predictor variable is correctly described by a linear regression
- the errors $e_i$ are independent and identically distributed
- the errors $e_i$ do not have any extreme outliers

The second assumption, amongst other things, assumes that all of our observations are independent.

There are essentially two techniques used for statistical inference. They are closely related:

- confidence intervals
- hypothesis tests

I tend to focus on confidence intervals as they are most useful in data science applications.

Given a confidence level (such as 95%), a confidence interval is an algorithm
that supplies a range of guesses for an unknown parameters that will include the
correct value (assuming the assumptions are valid) with probability equal to the
confidence level. In other words, if we use a 95% confidence interval in many
experiments, the `true`

parameters will be captured on average 19 out of every
20 times we use it.

Let's print out the model summary again:

In [11]:

```
model = smf.ols(formula="cty ~ hwy + displ", data=df)
result = model.fit()
print(result.summary())
```

The values to the far right of the coefficents give the 95% confidence intervals for the intercept and slopes. For example, our best guess of the hwy slope is $0.5954$, but the confidence interval ranges from $0.556$ to $0.635$.

As a somewhat side element to the notes here, you may also find it useful to draw boxplots using the pandas module. Here, the median (middle) value of a continuous variable is plotted again a categorical variable. The width of the box tells us how variable the measurments are within each class.

In [12]:

```
df.boxplot('hwy', by='class_x')
```

Out[12]: