{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tutorial 15: Statistical Models\n",
"\n",
"In this tutorial we learn how to build inferential statistical models\n",
"using the `statsmodels` module. Start by loading the module as well as\n",
"pandas, matplotlib, and iplot."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib as mpl\n",
"import pandas as pd\n",
"import statsmodels.formula.api as smf\n",
"\n",
"import iplot\n",
"\n",
"assert iplot.__version__ >= 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mpl.rcParams['figure.figsize'] = (20.0, 10.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These notes give, in just a few minutes, a quick overview of basic inferential\n",
"statistics that would typically cover 3-4 weeks in an introduction to statistics\n",
"course. I assume, as it is given as an informal prerequisite for this course, that\n",
"you have all seen this material before and so this is just given as a refresher\n",
"without getting too much into the details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regression\n",
"\n",
"Linear models are arguably the most well known and often used\n",
"methods for modeling data. They are employed to model the\n",
"outcomes of patients in clinical trials, the price of financial\n",
"instruments, the lifetimes of fruit flies, and many other\n",
"responses from a wide range of fields. Their popularity is not\n",
"unwarranted. In fact, the discussion of linear models and their\n",
"variants take up a considerable portion of this text.\n",
"\n",
"Consider observing $n$ pairs of data $(x_i, y_i)$. A simple linear\n",
"model would assume that the data are generated according to the\n",
"equation\n",
"\n",
"$$ y_i = \\beta_0 + \\beta_1 x_i + \\epsilon_i $$\n",
"\n",
"Where $\\epsilon_i$ is some unobserved error term and the $\\beta_j$'s\n",
"are unknown constants. Geometrically, this is a line with a specific\n",
"intercept and slope. The goal of statistical modeling is to use\n",
"the observed data to, in some fashion, estimate the parameters\n",
"$\\beta_0$ and $\\beta_1$. \n",
"\n",
"Let's read in a small dataset to show visually how this works. This is\n",
"another classic example often used in statitics courses; it gives values\n",
"about certain makes and models of cars. (Note: I need to create a new\n",
"name for the variable 'class' because it conflicts with a keyword in\n",
"Python)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"https://statsmaths.github.io/stat_data/mpg.csv\")\n",
"df['class_x'] = df['class']\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's draw a scatter plot of the variable `hwy` (miles per gallon\n",
"on the highway) and `cty` (miles per gallon in city road conditions)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"p = iplot.create_figure(df, 'hwy', 'cty')\n",
"iplot.show(p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the data appear to follow a linear trend: as one variable increases so does\n",
"the other. Let's try to put a line segement directly on the plot (I found the best line\n",
"by trial and error)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"p.segment(12, 9, 45, 32, color='red')\n",
"iplot.show(p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Trial and error is not a great way to figure out the best line to run through a plot.\n",
"Instead, we want an precise algorithm that attempts to optimize some metric. The most\n",
"common example is the ordinary least squares (OLS) method; this finds the slope and\n",
"intercept that minimizes the squared residuals (amount the line misses the y-variable):\n",
"\n",
"$$ \\widehat{\\beta} = \\text{argmin}_b \\left\\{ \\sum_i \\left( y_i - b_0 - b_1 \\cdot x_i \\right)^2 \\right\\} $$\n",
"\n",
"We will use the `statsmodels` module to detect the ordinary least squares estimator\n",
"using `smf.ols`. Here, create a model that predicts a line estimating the city miles\n",
"per gallon variable as a function of the highway variable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = smf.ols(formula=\"cty ~ hwy\", data=df)\n",
"model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to actually fit the model to the data using the `fit` method. Printing\n",
"the result shows a lot of information!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result = model.fit()\n",
"print(result.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `coef` column in the middle gives the predicted values of the regression. Here,\n",
"the model predicts that the relationship is given by:\n",
"\n",
"$$ \\text{cty}_i = 0.8442 + 0.6832 \\cdot \\text{hwy}_i + \\epsilon_i $$\n",
"\n",
"Finally, we can add the predicted values back into the pandas dataframe using the\n",
"`predict` method and then show the predicted values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['pred'] = result.predict()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"p = iplot.create_figure(df, 'hwy', 'pred')\n",
"iplot.show(p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the values all fall along a line.\n",
"\n",
"We can build regression models that use multiple variables to estimate the response.\n",
"To do this, add (literally) variables to the right hand side of the formula object\n",
"as seen below (displacement is the size of the engine):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = smf.ols(formula=\"cty ~ hwy + displ\", data=df)\n",
"result = model.fit()\n",
"print(result.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model now predicts that the relationship is given by:\n",
"\n",
"$$ \\text{cty}_i = 4.7368 + 0.5954 \\cdot \\text{hwy}_i -0.5283 \\cdot \\text{displ}_i + \\epsilon_i $$\n",
"\n",
"This is a plane rather than a line, and harder to visualize. It is also more difficult to reason\n",
"about. But, multiple regressions are powerful tools and will come up frequently in this course and\n",
"in other applications that you see across statistics and data science."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Inference\n",
"\n",
"In the above example, we tried to fit a linear regression to some example data.\n",
"Ordinary least squares was used to estimate the best value of the slopes and \n",
"intercept. Statistical inference goes one step further by describing how confident\n",
"we are in our guesses. In order to do this, it makes several assumptions about \n",
"how the data are distributed. These are:\n",
"\n",
"- the relationship between the response $y_i$ and the predictor variable is \n",
"correctly described by a linear regression\n",
"- the errors $e_i$ are independent and identically distributed\n",
"- the errors $e_i$ do not have any extreme outliers\n",
"\n",
"The second assumption, amongst other things, assumes that all of our observations\n",
"are independent. \n",
"\n",
"There are essentially two techniques used for statistical inference. They are\n",
"closely related:\n",
"\n",
"- confidence intervals\n",
"- hypothesis tests\n",
"\n",
"I tend to focus on confidence intervals as they are most useful in data science\n",
"applications.\n",
"\n",
"Given a confidence level (such as 95%), a confidence interval is an algorithm\n",
"that supplies a range of guesses for an unknown parameters that will include the\n",
"correct value (assuming the assumptions are valid) with probability equal to the\n",
"confidence level. In other words, if we use a 95% confidence interval in many\n",
"experiments, the `true` parameters will be captured on average 19 out of every\n",
"20 times we use it.\n",
"\n",
"Let's print out the model summary again:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = smf.ols(formula=\"cty ~ hwy + displ\", data=df)\n",
"result = model.fit()\n",
"print(result.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The values to the far right of the coefficents give the 95% confidence intervals\n",
"for the intercept and slopes. For example, our best guess of the hwy slope is\n",
"$0.5954$, but the confidence interval ranges from $0.556$ to $0.635$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Boxplot\n",
"\n",
"As a somewhat side element to the notes here, you may also find it useful to\n",
"draw boxplots using the pandas module. Here, the median (middle) value of a\n",
"continuous variable is plotted again a categorical variable. The width of the\n",
"box tells us how variable the measurments are within each class."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.boxplot('hwy', by='class_x', )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}