Tutorial 15: Statistical Models

In this tutorial we learn how to build inferential statistical models using the statsmodels module. Start by loading the module as well as pandas, matplotlib, and iplot.

In [1]:
%matplotlib inline
import matplotlib as mpl
import pandas as pd
import statsmodels.formula.api as smf

import iplot

assert iplot.__version__ >= 1
Loading BokehJS ...
In [2]:
mpl.rcParams['figure.figsize'] = (20.0, 10.0)

These notes give, in just a few minutes, a quick overview of basic inferential statistics that would typically cover 3-4 weeks in an introduction to statistics course. I assume, as it is given as an informal prerequisite for this course, that you have all seen this material before and so this is just given as a refresher without getting too much into the details.

Regression

Linear models are arguably the most well known and often used methods for modeling data. They are employed to model the outcomes of patients in clinical trials, the price of financial instruments, the lifetimes of fruit flies, and many other responses from a wide range of fields. Their popularity is not unwarranted. In fact, the discussion of linear models and their variants take up a considerable portion of this text.

Consider observing $n$ pairs of data $(x_i, y_i)$. A simple linear model would assume that the data are generated according to the equation

$$ y_i = \beta_0 + \beta_1 x_i + \epsilon_i $$

Where $\epsilon_i$ is some unobserved error term and the $\beta_j$'s are unknown constants. Geometrically, this is a line with a specific intercept and slope. The goal of statistical modeling is to use the observed data to, in some fashion, estimate the parameters $\beta_0$ and $\beta_1$.

Let's read in a small dataset to show visually how this works. This is another classic example often used in statitics courses; it gives values about certain makes and models of cars. (Note: I need to create a new name for the variable 'class' because it conflicts with a keyword in Python)

In [3]:
df = pd.read_csv("https://statsmaths.github.io/stat_data/mpg.csv")
df['class_x'] = df['class']
df.head()
Out[3]:
manufacturer model displ year cyl trans drv cty hwy fl class class_x
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact compact

Let's draw a scatter plot of the variable hwy (miles per gallon on the highway) and cty (miles per gallon in city road conditions).

In [4]:
p = iplot.create_figure(df, 'hwy', 'cty')
iplot.show(p)

Notice that the data appear to follow a linear trend: as one variable increases so does the other. Let's try to put a line segement directly on the plot (I found the best line by trial and error).

In [5]:
p.segment(12, 9, 45, 32, color='red')
iplot.show(p)

Trial and error is not a great way to figure out the best line to run through a plot. Instead, we want an precise algorithm that attempts to optimize some metric. The most common example is the ordinary least squares (OLS) method; this finds the slope and intercept that minimizes the squared residuals (amount the line misses the y-variable):

$$ \widehat{\beta} = \text{argmin}_b \left\{ \sum_i \left( y_i - b_0 - b_1 \cdot x_i \right)^2 \right\} $$

We will use the statsmodels module to detect the ordinary least squares estimator using smf.ols. Here, create a model that predicts a line estimating the city miles per gallon variable as a function of the highway variable.

In [6]:
model = smf.ols(formula="cty ~ hwy", data=df)
model
Out[6]:
<statsmodels.regression.linear_model.OLS at 0x111cac470>

We need to actually fit the model to the data using the fit method. Printing the result shows a lot of information!

In [7]:
result = model.fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    cty   R-squared:                       0.914
Model:                            OLS   Adj. R-squared:                  0.913
Method:                 Least Squares   F-statistic:                     2459.
Date:                Wed, 26 Sep 2018   Prob (F-statistic):          1.87e-125
Time:                        10:22:57   Log-Likelihood:                -383.69
No. Observations:                 234   AIC:                             771.4
Df Residuals:                     232   BIC:                             778.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.8442      0.333      2.534      0.012       0.188       1.501
hwy            0.6832      0.014     49.585      0.000       0.656       0.710
==============================================================================
Omnibus:                        3.986   Durbin-Watson:                   1.093
Prob(Omnibus):                  0.136   Jarque-Bera (JB):                4.565
Skew:                           0.114   Prob(JB):                        0.102
Kurtosis:                       3.645   Cond. No.                         98.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coef column in the middle gives the predicted values of the regression. Here, the model predicts that the relationship is given by:

$$ \text{cty}_i = 0.8442 + 0.6832 \cdot \text{hwy}_i + \epsilon_i $$

Finally, we can add the predicted values back into the pandas dataframe using the predict method and then show the predicted values.

In [8]:
df['pred'] = result.predict()
In [9]:
p = iplot.create_figure(df, 'hwy', 'pred')
iplot.show(p)

Notice that the values all fall along a line.

We can build regression models that use multiple variables to estimate the response. To do this, add (literally) variables to the right hand side of the formula object as seen below (displacement is the size of the engine):

In [10]:
model = smf.ols(formula="cty ~ hwy + displ", data=df)
result = model.fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    cty   R-squared:                       0.924
Model:                            OLS   Adj. R-squared:                  0.924
Method:                 Least Squares   F-statistic:                     1412.
Date:                Wed, 26 Sep 2018   Prob (F-statistic):          2.93e-130
Time:                        10:22:57   Log-Likelihood:                -368.30
No. Observations:                 234   AIC:                             742.6
Df Residuals:                     231   BIC:                             753.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.7368      0.751      6.306      0.000       3.257       6.217
hwy            0.5954      0.020     29.602      0.000       0.556       0.635
displ         -0.5283      0.093     -5.699      0.000      -0.711      -0.346
==============================================================================
Omnibus:                       18.031   Durbin-Watson:                   1.207
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.047
Skew:                           0.302   Prob(JB):                     4.49e-10
Kurtosis:                       5.012   Cond. No.                         240.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The model now predicts that the relationship is given by:

$$ \text{cty}_i = 4.7368 + 0.5954 \cdot \text{hwy}_i -0.5283 \cdot \text{displ}_i + \epsilon_i $$

This is a plane rather than a line, and harder to visualize. It is also more difficult to reason about. But, multiple regressions are powerful tools and will come up frequently in this course and in other applications that you see across statistics and data science.

Inference

In the above example, we tried to fit a linear regression to some example data. Ordinary least squares was used to estimate the best value of the slopes and intercept. Statistical inference goes one step further by describing how confident we are in our guesses. In order to do this, it makes several assumptions about how the data are distributed. These are:

  • the relationship between the response $y_i$ and the predictor variable is correctly described by a linear regression
  • the errors $e_i$ are independent and identically distributed
  • the errors $e_i$ do not have any extreme outliers

The second assumption, amongst other things, assumes that all of our observations are independent.

There are essentially two techniques used for statistical inference. They are closely related:

  • confidence intervals
  • hypothesis tests

I tend to focus on confidence intervals as they are most useful in data science applications.

Given a confidence level (such as 95%), a confidence interval is an algorithm that supplies a range of guesses for an unknown parameters that will include the correct value (assuming the assumptions are valid) with probability equal to the confidence level. In other words, if we use a 95% confidence interval in many experiments, the true parameters will be captured on average 19 out of every 20 times we use it.

Let's print out the model summary again:

In [11]:
model = smf.ols(formula="cty ~ hwy + displ", data=df)
result = model.fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    cty   R-squared:                       0.924
Model:                            OLS   Adj. R-squared:                  0.924
Method:                 Least Squares   F-statistic:                     1412.
Date:                Wed, 26 Sep 2018   Prob (F-statistic):          2.93e-130
Time:                        10:22:57   Log-Likelihood:                -368.30
No. Observations:                 234   AIC:                             742.6
Df Residuals:                     231   BIC:                             753.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.7368      0.751      6.306      0.000       3.257       6.217
hwy            0.5954      0.020     29.602      0.000       0.556       0.635
displ         -0.5283      0.093     -5.699      0.000      -0.711      -0.346
==============================================================================
Omnibus:                       18.031   Durbin-Watson:                   1.207
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.047
Skew:                           0.302   Prob(JB):                     4.49e-10
Kurtosis:                       5.012   Cond. No.                         240.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The values to the far right of the coefficents give the 95% confidence intervals for the intercept and slopes. For example, our best guess of the hwy slope is $0.5954$, but the confidence interval ranges from $0.556$ to $0.635$.

Boxplot

As a somewhat side element to the notes here, you may also find it useful to draw boxplots using the pandas module. Here, the median (middle) value of a continuous variable is plotted again a categorical variable. The width of the box tells us how variable the measurments are within each class.

In [12]:
df.boxplot('hwy', by='class_x')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a251c87b8>