Class 05: Buying Property in The Golden State
California Housing Prices
Today we are going to look at a dataset of housing prices from California. Loading it in we see that each row of the dataset is a small neighborhood (technically a census tract) and the goal is to predict the median house price within the tract.
Looking at the dataset we see a number of potentially useful covariates. Our interest right now focuses on just the mean household income, latitude, and longitude of the neighborhood. The lab for today will ask you to extrapolate on this using the remaining variables.
In the prior two classes we have introduced the notion of linear models. We have even extended this idea to multivariate linear models and linear models with categorical variables. What about relationships that are non-linear? It may seem that the linear model is highly restrictive, perhaps even to the point of uselessness on larger, more complex datasets.
The catch is that linear models need only be linear in the unknown parameters. That is, the following is not a linear model:
However, this is:
By picking large enough polynomials, in theory we can approximate any relationship between X and Y.
Let’s take a look at the relationship between mean household income and median house value. I’ll put a best fit line through the data to visually see how well it performs:
The line does a decent job, but seems to over predict house values in areas with very high mean household incomes.
Let’s remove the
model = 'lm' in the
geom_smooth layer and see what
a non-linear fit looks like:
This line, in contrast, seems to fit the data much more accurately.
We will see later today the exact algorithm used in
For now, let’s us the polynomial expansion trick we just derived
to fit the curve using
lm. We will fit a cubic polynomial to the
data. The first step is to create two new variables:
In order to compare how well a cubic polynomial fits the data, let’s evaluate how well the mean model performs:
And the linear model:
Now, we’ll build a cubic model to fit to the data and see how well it seems to fit visually.
Notice that our polynomial now predicts a non-linear relationship between the mean household income and the median house value. The predictions are slightly better than the linear model, though not by very much as the linear model performed fairly well already:
Basis expansion with poly()
The technique of creating more complex relationships by explicitly adding transformed versions of the original variables is called basis expansion. It is a very powerful and useful tool.
We can make use of polynomial expansions without having
to directly construct all of the polynomial terms using
poly function. For example, here we use it to fit
the cubic polynomial again:
Note that the predicted values are the same as before, however the exact polynomial terms are actually different. R has picked a polynomial basis which is much more numerically stable that the naive one we used. We will explore this more when we dive into R matricies next week.
With this new function it is easy to expand the polynomial basis to a large order. Try for example, 15:
Notice that we get edge effects, where the polynomial acts very strangly for extreme values of the X variable. The fit here is marginally better than the cubic fit:
With large polynomials, we are in a sense approximating the non-linear regression model given by the following:
Where the goal is to estimate the function f. Here we are doing so by approximating it by a large polynomial. A natural extension of non-linear regression is the additive model. Here there is a non-linear relationship between the predictor variables and the response, but their are no interaction terms between the two predictor variables:
We can fit models like this by the same polynomial trick. Here, let’s use it to predict median house value as a function the mean and median household incomes:
Again, we see minor improvements with the increased complexity of the model.
A more general form of the additive model would allow arbitrary interactions between the predictor variables. We would write this as the following:
To approximate this with a polynomial, we would need to include all interaction terms as well. For example, here is the quadratic approximation:
In the quadratic case, the number of terms grows only by 1. For higher orders to number of interactions increases significantly faster.
To fit such models in R, we simply use the
poly function and
give it multiple variables. Here is the cubic interaction of
mean and median household income:
The numbers in the coefficients give the degree of the first term, a dot, and the degree of the second term. Once again, this more complex model offers some (modest) gains over the simpler models.
And, as we have seen, this increases the predictiveness of the model by a slight amount.
As a general pattern, we see that increasing the complexity of the model creates a more predictive output. This should not be taken as a given. Let’s increase the interaction polynomial degrees from 6, to 10, and to 12. Notice what happens with the training and validation errors:
We see that the training error slowly gets lower as we increase the degree of the polynomial. With more terms to use, it will always improve the fit on the data it is trained on (at least up to numerical stability). However, the higher order fits begin to over fit to the training data. For degree six polynomials this creates a slightly worse fit. For degree 10 the fit is significantly worse. At degree 12, the validation RMSE is many times higher than the training error.
As a general rule, we use the validation set to determine exactly how complex to make the model by determining which model minimizes the RMSE on the validation set. Visually, we see the following relationship (though in our case we do not know the test set and are comparing with the validation set):
I have several videos that demonstrate the process of moving from over-fitting to under-fitting. Here are three examples:
While we have not yet seen these particular models, the basic ideas between over-fitting and under-fitting still apply.
Latitude and Longitude
By looking at how predictive the models are, we can clearly see that there are non-linearities and interactions between the relationship of mean and median income and median house values. Let’s move on, however, to a more obvious example where the relationship should be non-linear and need interactions: using longitude and latitude to predict house value.
A simple plots show how using longitude and latitude creates and ad hoc map.
If you know California well, you may be able to infer a lot from this
plot. Otherwise, layering a map under the data helps substantially.
Using the ggmap package, I can replace the
qplot function with
qmplot function to reproduce the prior plot with a map.
As we might expect, prices jump near Los Angeles and San Francisco and drop as you move inland.
Let’s fit a model to the data using the
And see what this plot looks like over the map:
As hoped, our model lights up near the two big cities and decreases as we move inland.
Other basis expansions
There are other alternatives to using a polynomial basis. One of my
favorite is to do a discrete transformation of the dataset using the
cut function. Here, we break every mean household income into ten
A linear model fit on this variable (we use the
to tell R to treat it as a categorical variable), is able to fit
any abritrary number to each bin resulting in fits that can be
bin function in the smodels package breaks the
variable into a number of bins as well. However, here the bins are
required to contain the same number of points rather than the being
the same width:
There is not a direct function for computing two dimensional cuts or bins, but we can create these directly without much trouble. See for example the following code to bin the latitude and longitude:
The resulting plot yields a different prediction for each bin:
Linear smoothers with gam
We conclude with a brief introduction to the gam package. The
package allows us to replace the
lm function with the
With no other options selected, this will not effect the results in
any way. The benefit comes from the included functions
stands for smoothing spline, and
lo, which stands for LOcal polynomial
smoothers. We can use these functions as a drop-in replacement
poly function. However, unlike poly we do not need to
specify the degree of the polynomial as the complexity of the model
will be determined by the data:
We can add two smoothed variable together much as we did with the poly function:
lo function, but not the
s function, also allows for
multivariate effects. So, we can smooth over interactions between
latitude and longitude:
Notice that the
lo functions are doing more than intelligently selecting
the degree of the polynomial. There are some downsides to using the
compared to polynomial basis expansion with the
lm function. The primary downside
is that the computational time can be quite long with larger datasets, particularly
if we try to fit iteraction terms. In fact, just the last block of code took a
non-negligable amount of time to run and we only have a few thousand rows of data
and two variables. Secondly, iteraction terms can become unstable in dimensions
three or more. Often, a low-order polynomial expansion can be used up to even
4 or 5 dimensions if there is sufficent data.