Class 05: Buying Property in The Golden State

2017-09-12

library(readr)
library(ggplot2)
library(dplyr)

California Housing Prices

Today we are going to look at a dataset of housing prices from California. Loading it in we see that each row of the dataset is a small neighborhood (technically a census tract) and the goal is to predict the median house price within the tract.

ca <- read_csv("https://statsmaths.github.io/ml_data/ca_house_price.csv")

Looking at the dataset we see a number of potentially useful covariates. Our interest right now focuses on just the mean household income, latitude, and longitude of the neighborhood. The lab for today will ask you to extrapolate on this using the remaining variables.

Non-linear relationships

In the prior two classes we have introduced the notion of linear models. We have even extended this idea to multivariate linear models and linear models with categorical variables. What about relationships that are non-linear? It may seem that the linear model is highly restrictive, perhaps even to the point of uselessness on larger, more complex datasets.

The catch is that linear models need only be linear in the unknown parameters. That is, the following is not a linear model:

However, this is:

By picking large enough polynomials, in theory we can approximate any relationship between X and Y.

Let’s take a look at the relationship between mean household income and median house value. I’ll put a best fit line through the data to visually see how well it performs:

qplot(mean_household_income, median_house_value, data = ca) +
  geom_smooth(method = "lm")

plot of chunk unnamed-chunk-3

The line does a decent job, but seems to over predict house values in areas with very high mean household incomes.

Let’s remove the model = 'lm' in the geom_smooth layer and see what a non-linear fit looks like:

qplot(mean_household_income, median_house_value, data = ca) +
  geom_smooth()

plot of chunk unnamed-chunk-4

This line, in contrast, seems to fit the data much more accurately. We will see later today the exact algorithm used in geom_smooth. For now, let’s us the polynomial expansion trick we just derived to fit the curve using lm. We will fit a cubic polynomial to the data. The first step is to create two new variables:

ca$mean_household_income_2 <- ca$mean_household_income^2
ca$mean_household_income_3 <- ca$mean_household_income^3

In order to compare how well a cubic polynomial fits the data, let’s evaluate how well the mean model performs:

avg <- mean(ca$median_house_value, na.rm = TRUE)
sqrt(tapply((ca$median_house_value - avg)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 198343.4 203487.0

And the linear model:

model <- lm(median_house_value ~ mean_household_income,
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
sqrt(tapply((ca$median_house_value - ca$value_pred)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 138582.8 139739.5

Now, we’ll build a cubic model to fit to the data and see how well it seems to fit visually.

model <- lm(median_house_value ~ mean_household_income +
              mean_household_income_2 + mean_household_income_3,
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
qplot(mean_household_income, value_pred, data = ca)

plot of chunk unnamed-chunk-8

Notice that our polynomial now predicts a non-linear relationship between the mean household income and the median house value. The predictions are slightly better than the linear model, though not by very much as the linear model performed fairly well already:

sqrt(tapply((ca$median_house_value - ca$value_pred)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 137440.3 138633.9

Basis expansion with poly()

The technique of creating more complex relationships by explicitly adding transformed versions of the original variables is called basis expansion. It is a very powerful and useful tool.

We can make use of polynomial expansions without having to directly construct all of the polynomial terms using the poly function. For example, here we use it to fit the cubic polynomial again:

model <- lm(median_house_value ~
              poly(mean_household_income, degree = 3),
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
qplot(mean_household_income, value_pred, data = ca)

plot of chunk unnamed-chunk-10

Note that the predicted values are the same as before, however the exact polynomial terms are actually different. R has picked a polynomial basis which is much more numerically stable that the naive one we used. We will explore this more when we dive into R matricies next week.

With this new function it is easy to expand the polynomial basis to a large order. Try for example, 15:

model <- lm(median_house_value ~ poly(mean_household_income, 15),
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
qplot(mean_household_income, value_pred, data = ca)

plot of chunk unnamed-chunk-11

Notice that we get edge effects, where the polynomial acts very strangly for extreme values of the X variable. The fit here is marginally better than the cubic fit:

sqrt(tapply((ca$median_house_value - ca$value_pred)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 137251.2 138451.0

Additive models

With large polynomials, we are in a sense approximating the non-linear regression model given by the following:

Where the goal is to estimate the function f. Here we are doing so by approximating it by a large polynomial. A natural extension of non-linear regression is the additive model. Here there is a non-linear relationship between the predictor variables and the response, but their are no interaction terms between the two predictor variables:

We can fit models like this by the same polynomial trick. Here, let’s use it to predict median house value as a function the mean and median household incomes:

model <- lm(median_house_value ~ poly(mean_household_income, 3) +
              poly(median_household_income, 3),
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
sqrt(tapply((ca$median_house_value - ca$value_pred)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 134635.4 135243.2

Again, we see minor improvements with the increased complexity of the model.

Non-linear interactions

A more general form of the additive model would allow arbitrary interactions between the predictor variables. We would write this as the following:

To approximate this with a polynomial, we would need to include all interaction terms as well. For example, here is the quadratic approximation:

In the quadratic case, the number of terms grows only by 1. For higher orders to number of interactions increases significantly faster.

To fit such models in R, we simply use the poly function and give it multiple variables. Here is the cubic interaction of mean and median household income:

model <- lm(median_house_value ~
               poly(mean_household_income, median_household_income,
                    degree = 3),
            subset = train_id == "train",
            data = ca)
model
## 
## Call:
## lm(formula = median_house_value ~ poly(mean_household_income, 
##     median_household_income, degree = 3), data = ca, subset = train_id == 
##     "train")
## 
## Coefficients:
##                                                         (Intercept)  
##                                                              317644  
## poly(mean_household_income, median_household_income, degree = 3)1.0  
##                                                            12988874  
## poly(mean_household_income, median_household_income, degree = 3)2.0  
##                                                            -9567618  
## poly(mean_household_income, median_household_income, degree = 3)3.0  
##                                                            -1618524  
## poly(mean_household_income, median_household_income, degree = 3)0.1  
##                                                            -6387193  
## poly(mean_household_income, median_household_income, degree = 3)1.1  
##                                                           817671980  
## poly(mean_household_income, median_household_income, degree = 3)2.1  
##                                                           234873162  
## poly(mean_household_income, median_household_income, degree = 3)0.2  
##                                                            -1002996  
## poly(mean_household_income, median_household_income, degree = 3)1.2  
##                                                          -422189888  
## poly(mean_household_income, median_household_income, degree = 3)0.3  
##                                                             4147020

The numbers in the coefficients give the degree of the first term, a dot, and the degree of the second term. Once again, this more complex model offers some (modest) gains over the simpler models.

ca$value_pred <- predict(model, newdata = ca)
sqrt(tapply((ca$median_house_value - ca$value_pred)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 134068.0 135014.9

And, as we have seen, this increases the predictiveness of the model by a slight amount.

Over fitting

As a general pattern, we see that increasing the complexity of the model creates a more predictive output. This should not be taken as a given. Let’s increase the interaction polynomial degrees from 6, to 10, and to 12. Notice what happens with the training and validation errors:

model <- lm(median_house_value ~
               poly(mean_household_income, median_household_income,
                    degree = 6),
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
sqrt(tapply((ca$median_house_value - ca$value_pred)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 132948.1 134918.0
model <- lm(median_house_value ~
               poly(mean_household_income, median_household_income,
                    degree = 10),
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
sqrt(tapply((ca$median_house_value - ca$value_pred)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 132430.0 186558.9
model <- lm(median_house_value ~
               poly(mean_household_income, median_household_income,
                    degree = 12),
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
sqrt(tapply((ca$median_house_value - ca$value_pred)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 132167.4 394899.8

We see that the training error slowly gets lower as we increase the degree of the polynomial. With more terms to use, it will always improve the fit on the data it is trained on (at least up to numerical stability). However, the higher order fits begin to over fit to the training data. For degree six polynomials this creates a slightly worse fit. For degree 10 the fit is significantly worse. At degree 12, the validation RMSE is many times higher than the training error.

As a general rule, we use the validation set to determine exactly how complex to make the model by determining which model minimizes the RMSE on the validation set. Visually, we see the following relationship (though in our case we do not know the test set and are comparing with the validation set):

I have several videos that demonstrate the process of moving from over-fitting to under-fitting. Here are three examples:

While we have not yet seen these particular models, the basic ideas between over-fitting and under-fitting still apply.

Latitude and Longitude

By looking at how predictive the models are, we can clearly see that there are non-linearities and interactions between the relationship of mean and median income and median house values. Let’s move on, however, to a more obvious example where the relationship should be non-linear and need interactions: using longitude and latitude to predict house value.

A simple plots show how using longitude and latitude creates and ad hoc map.

library(viridis)
qplot(longitude, latitude, data = ca[ca$train_id != 'test',],
      color = median_house_value, size = I(0.5)) +
  scale_color_viridis()

plot of chunk unnamed-chunk-19

If you know California well, you may be able to infer a lot from this plot. Otherwise, layering a map under the data helps substantially. Using the ggmap package, I can replace the qplot function with the qmplot function to reproduce the prior plot with a map.

library(ggmap)
qmplot(longitude, latitude, data = ca[ca$train_id != 'test',],
       color = median_house_value, size = I(0.5)) +
  scale_color_viridis()

plot of chunk unnamed-chunk-20

As we might expect, prices jump near Los Angeles and San Francisco and drop as you move inland.

Let’s fit a model to the data using the poly function:

model <- lm(median_house_value ~
               poly(latitude, longitude,
                    degree = 6),
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
sqrt(tapply((ca$median_house_value - ca$value_pred)^2,
            ca$train_id, mean))
##     test    train    valid 
##       NA 157151.3 161607.1

And see what this plot looks like over the map:

library(ggmap)
qmplot(longitude, latitude, data = ca,
       color = value_pred, size = I(0.5)) +
  scale_color_viridis()

plot of chunk unnamed-chunk-22

As hoped, our model lights up near the two big cities and decreases as we move inland.

Other basis expansions

There are other alternatives to using a polynomial basis. One of my favorite is to do a discrete transformation of the dataset using the cut function. Here, we break every mean household income into ten buckets:

ca$mean_household_income_cut <- cut(ca$mean_household_income, 10, labels = FALSE)
qplot(mean_household_income, mean_household_income_cut, data = ca)

plot of chunk unnamed-chunk-23

A linear model fit on this variable (we use the factor function to tell R to treat it as a categorical variable), is able to fit any abritrary number to each bin resulting in fits that can be very non-linear.

model <- lm(median_house_value ~ factor(mean_household_income_cut),
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
qplot(mean_household_income, value_pred, data = ca)

plot of chunk unnamed-chunk-24

Similarly, the bin function in the smodels package breaks the variable into a number of bins as well. However, here the bins are required to contain the same number of points rather than the being the same width:

library(smodels)
ca$mean_household_income_bin <- bin(ca$mean_household_income, 10)
qplot(mean_household_income, mean_household_income_bin, data = ca)

plot of chunk unnamed-chunk-25

There is not a direct function for computing two dimensional cuts or bins, but we can create these directly without much trouble. See for example the following code to bin the latitude and longitude:

ca$latitude_bin <- cut(ca$latitude, 5, label = FALSE)
ca$longitude_bin <- cut(ca$longitude, 5, label = FALSE)
ca$lat_long_bin <- ca$latitude_bin + ca$longitude_bin * 10

The resulting plot yields a different prediction for each bin:

library(ggmap)
model <- lm(median_house_value ~  factor(lat_long_bin), data = ca)
ca$value_pred <- predict(model, newdata = ca)
qmplot(longitude, latitude, data = ca, color = value_pred) +
  scale_color_viridis()

plot of chunk unnamed-chunk-27

Linear smoothers with gam

We conclude with a brief introduction to the gam package. The package allows us to replace the lm function with the gam function. With no other options selected, this will not effect the results in any way. The benefit comes from the included functions s, which stands for smoothing spline, and lo, which stands for LOcal polynomial smoothers. We can use these functions as a drop-in replacement for the poly function. However, unlike poly we do not need to specify the degree of the polynomial as the complexity of the model will be determined by the data:

library(gam)
model <- gam(median_house_value ~
              s(mean_household_income),
            subset = train_id == "train",
            data = ca)
ca$value_pred <- predict(model, newdata = ca)
qplot(mean_household_income, value_pred, data = ca)

plot of chunk unnamed-chunk-28

We can add two smoothed variable together much as we did with the poly function:

library(ggmap)
model <- gam(median_house_value ~ s(latitude)  + s(longitude), data = ca)
ca$value_pred <- predict(model, newdata = ca)
qmplot(longitude, latitude, data = ca, color = value_pred) +
  scale_color_viridis()

plot of chunk unnamed-chunk-29

The lo function, but not the s function, also allows for multivariate effects. So, we can smooth over interactions between latitude and longitude:

library(ggmap)
model <- gam(median_house_value ~ lo(latitude, longitude), data = ca)
ca$value_pred <- predict(model, newdata = ca)
qmplot(longitude, latitude, data = ca, color = value_pred) +
  scale_color_viridis()

plot of chunk unnamed-chunk-30

Notice that the s and lo functions are doing more than intelligently selecting the degree of the polynomial. There are some downsides to using the gam function compared to polynomial basis expansion with the lm function. The primary downside is that the computational time can be quite long with larger datasets, particularly if we try to fit iteraction terms. In fact, just the last block of code took a non-negligable amount of time to run and we only have a few thousand rows of data and two variables. Secondly, iteraction terms can become unstable in dimensions three or more. Often, a low-order polynomial expansion can be used up to even 4 or 5 dimensions if there is sufficent data.