Last week, we talked about using decision trees for regression and classification problems. Here, we are going to actual apply these techniques to the California and Pennsylvania housing datasets.

As a first task, let us again take just the California data and try to predict the logarithm of the median house value in each census tract.

```
set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
ca <- x[x$statefp==6,] # just take CA data
trainFlag <- (runif(nrow(ca)) < 0.66)
```

In order to fit a single regression tree, we use the R package **tree**. As a starting point, I will just use longitude and latitude.

```
library(tree)
tf <- tree(log(median_house_value) ~ longitude + latitude, data = ca)
plot(tf)
text(tf, cex=0.75)
```

It is easy to incorporate more variables in the model. How much do these change the tree that is learned from the data?

```
library(tree)
tf <- tree(log(median_house_value) ~ longitude + latitude +
median_household_income + median_rooms,
data = ca)
plot(tf)
text(tf, cex=0.75)
```

Which variables might we want to incorporate if we want a highly predictive model? I am going to start by using these:

`names(ca)[c(6,11:15,33:34)]`

```
## [1] "population" "total_units"
## [3] "vacant_units" "median_rooms"
## [5] "mean_household_size_owners" "mean_household_size_renters"
## [7] "median_household_income" "mean_household_income"
```

I find it easier to use matrices directly rather than the formula interface in the **randomForest** package (it is much slower and makes subsequent predictions on new data more difficult). Here I define a matrix of predictors for both the training and testing sets, as well as the response for both training and testing.

```
X <- ca[, c(6,11:15,33:34)]
y <- log(ca$median_house_value)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
```

I will fit a small random forest estimator to this data with only 15 trees. Notice that the function allows me to directly give the test dataset to the function; if I also turn on the *do.trace* option, this shows how well we are predicting on the test set as the model learns more trees. Also notice that if I define a test set beforehand, we need to tell R to keep the forest in the output object. Otherwise the forest gets thrown away, making predictions on new datasets or calculating variable importance impossible (I have no idea why this they thought this would be a good default).

`library(randomForest)`

`## randomForest 4.6-12`

`## Type rfNews() to see new features/changes/bug fixes.`

```
rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
do.trace=TRUE, keep.forest=TRUE,
ntree=15L)
```

```
## | Out-of-bag | Test set |
## Tree | MSE %Var(y) | MSE %Var(y) |
## 1 | 0.1904 70.00 | 0.2044 70.87 |
## 2 | 0.1806 66.40 | 0.1571 54.47 |
## 3 | 0.1762 64.79 | 0.1445 50.11 |
## 4 | 0.1694 62.29 | 0.1366 47.39 |
## 5 | 0.1663 61.14 | 0.1285 44.55 |
## 6 | 0.1557 57.26 | 0.1241 43.05 |
## 7 | 0.1485 54.58 | 0.1214 42.08 |
## 8 | 0.1447 53.21 | 0.1194 41.41 |
## 9 | 0.1426 52.42 | 0.1177 40.80 |
## 10 | 0.139 51.09 | 0.1164 40.37 |
## 11 | 0.1335 49.08 | 0.1154 40.02 |
## 12 | 0.1324 48.67 | 0.1156 40.09 |
## 13 | 0.1298 47.74 | 0.1158 40.14 |
## 14 | 0.1274 46.82 | 0.1151 39.92 |
## 15 | 0.125 45.95 | 0.1147 39.77 |
```

`rfObj`

```
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, ntree = 15L, do.trace = TRUE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 15
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1249941
## % Var explained: 54.05
## Test set MSE: 0.11
## % Var explained: 60.24
```

Notice that the randomForest estimator automatically calculates the training and testing MSE. Note: I would argue that we are treating **Xtest** and **ytest** as a validation set, but named them as such to following terminology in the *randomForest* function. Now, let’s turn off the *do.trace* option and run the estimator with 500 trees.

```
rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
do.trace=FALSE, keep.forest=TRUE,
ntree=500L)
rfObj
```

```
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, ntree = 500L, do.trace = FALSE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1040941
## % Var explained: 61.73
## Test set MSE: 0.11
## % Var explained: 61.73
```

```
rfYhat <- predict(rfObj, Xtest)
mean((rfYhat - ytest)^2)
```

`## [1] 0.1103721`

Here we have explicitly calculated the MSE as well. How much better does the addition of 485 more trees do compared to the simpler model? As mentioned last time, we can use a random forest to calculate the importance of each of the variables in the model, as measured by how often each variable is used for splitting on and how much it decreases the MSE. This is done with the *importance* function (remember, this only works if you set *keep.forest* to TRUE).

`importance(rfObj)`

```
## IncNodePurity
## population 70.42507
## total_units 65.93293
## vacant_units 93.05627
## median_rooms 157.68088
## mean_household_size_owners 90.45918
## mean_household_size_renters 119.44500
## median_household_income 270.05590
## mean_household_income 410.49938
```

The scale of the output is not terribly important, and we are only going to focus on the relative sizes of these importance scores. Household income seems to be the most important, followed by the median number of rooms. Do these seem reasonable to you?

While the exact structure of one tree in a random forest is difficult to make too much sense of in general, I think it is useful to understand how these trees are being stored as a data structure in R. Let’s look at the first few rows of the first tree object of the random forest:

`head(getTree(rfObj,k=1),15)`

```
## left daughter right daughter split var split point status prediction
## 1 2 3 4 5.650 -3 12.81532
## 2 4 5 5 3.145 -3 12.73616
## 3 6 7 3 76.500 -3 13.02461
## 4 8 9 2 1954.000 -3 12.82533
## 5 10 11 7 46570.000 -3 12.60980
## 6 12 13 8 95428.500 -3 13.13880
## 7 14 15 7 77486.000 -3 12.89324
## 8 16 17 1 1988.500 -3 12.76969
## 9 18 19 8 69669.000 -3 12.90005
## 10 20 21 7 35629.000 -3 12.47093
## 11 22 23 8 76361.500 -3 12.81060
## 12 24 25 5 3.625 -3 12.81310
## 13 26 27 7 106924.000 -3 13.32329
## 14 28 29 5 2.865 -3 12.58295
## 15 30 31 5 2.875 -3 13.14507
```

Why do you think the R implementation is using a matrix to store this information rather than a hierarchical list that better represents the actual structure of the tree?

Finally, note that we can easily increase the number of trees in the random forest even after it has already been constructed. Here we add 10 more trees with the *grow* function:

```
rfObj <- grow(rfObj, how.many=10)
rfObj
```

```
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, ntree = 500L, do.trace = FALSE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 510
## No. of variables tried at each split: 2
```

Let’s not return to the additive models we studied two weeks ago. I will fit an additive model using the same exact variables we had used in the random forest on the training set:

`library(mgcv)`

`## Loading required package: nlme`

`## This is mgcv 1.8-7. For overview type 'help("mgcv-package")'.`

```
ca.gam2 <- gam(log(median_house_value)
~ s(median_household_income) + s(mean_household_income)
+ s(population) + s(total_units) + s(vacant_units)
+ s(median_rooms) + s(mean_household_size_owners)
+ s(mean_household_size_renters)
, data=ca, subset=trainFlag)
```

Now, we can find predictions from the generalized additive model for both the training set and testing set. I will then reconstruct the **X** matrices to use this prediction itself as a predictor variable in the model.

```
ca$gamPred <- predict(ca.gam2, ca)
X <- ca[, c(6,11:15,33:35)]
y <- log(ca$median_house_value)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
```

Now we can run another random forest on this new dataset. Notice that the MSE decreases slightly using this new variable.

```
rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
do.trace=FALSE, keep.forest=TRUE)
rfObj
```

```
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, do.trace = FALSE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.09915329
## % Var explained: 63.55
## Test set MSE: 0.11
## % Var explained: 62.7
```

```
rfYhat <- predict(rfObj, Xtest)
mean((rfYhat - ytest)^2)
```

`## [1] 0.1075728`

How important is this new variable in the random forest model?

`importance(rfObj)`

```
## IncNodePurity
## population 50.97247
## total_units 47.14099
## vacant_units 53.38192
## median_rooms 76.87409
## mean_household_size_owners 60.17520
## mean_household_size_renters 74.61663
## median_household_income 153.18022
## mean_household_income 247.40573
## gamPred 524.55112
```

Now, I want to try gradient boosting on this same dataset. For this I will use the excellent **gbm** package in R. There are two functions for fitting gradient boosted trees (or machines, as the package calls them): *gbm* and *gbm.fit*. The first takes a formula object and the second take raw matrices. As said before, I prefer to deal with the raw matrix version as it is faster, but one needs to be careful that the training and testing matrices are constructed exactly the same. We will start with using 100 trees and a relatively high shrinkage factor of 0.1:

`library(gbm)`

`## Loading required package: survival`

`## Loading required package: lattice`

`## Loading required package: splines`

`## Loading required package: parallel`

`## Loaded gbm 2.1.1`

```
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=100L, shrinkage=0.1)
```

```
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.2497 nan 0.1000 0.0220
## 2 0.2315 nan 0.1000 0.0177
## 3 0.2157 nan 0.1000 0.0157
## 4 0.2024 nan 0.1000 0.0128
## 5 0.1910 nan 0.1000 0.0113
## 6 0.1809 nan 0.1000 0.0100
## 7 0.1723 nan 0.1000 0.0087
## 8 0.1647 nan 0.1000 0.0073
## 9 0.1581 nan 0.1000 0.0059
## 10 0.1520 nan 0.1000 0.0060
## 20 0.1175 nan 0.1000 0.0016
## 40 0.1007 nan 0.1000 0.0003
## 60 0.0976 nan 0.1000 0.0000
## 80 0.0968 nan 0.1000 -0.0001
## 100 0.0962 nan 0.1000 -0.0001
```

Notice that the variable importance scores look a bit more lopsided than in the random forest case. Why might this be the case?

`summary(gbmObj,plotit=FALSE)`

```
## var rel.inf
## gamPred gamPred 95.8896590
## mean_household_income mean_household_income 3.0322903
## mean_household_size_renters mean_household_size_renters 0.3405273
## median_household_income median_household_income 0.2254245
## median_rooms median_rooms 0.2145368
## mean_household_size_owners mean_household_size_owners 0.1490498
## population population 0.1485122
## total_units total_units 0.0000000
## vacant_units vacant_units 0.0000000
```

If we turn down shrinkage notice that the training deviance (a fancy way of saying the MSE) decreases much slower, and after 100 trees the variable importance scores are even more lopsided.

```
library(gbm)
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=100L, shrinkage=0.01)
```

```
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.2697 nan 0.0100 0.0023
## 2 0.2674 nan 0.0100 0.0022
## 3 0.2653 nan 0.0100 0.0022
## 4 0.2631 nan 0.0100 0.0022
## 5 0.2610 nan 0.0100 0.0021
## 6 0.2589 nan 0.0100 0.0021
## 7 0.2568 nan 0.0100 0.0020
## 8 0.2548 nan 0.0100 0.0020
## 9 0.2528 nan 0.0100 0.0020
## 10 0.2508 nan 0.0100 0.0019
## 20 0.2331 nan 0.0100 0.0016
## 40 0.2049 nan 0.0100 0.0012
## 60 0.1837 nan 0.0100 0.0009
## 80 0.1672 nan 0.0100 0.0007
## 100 0.1543 nan 0.0100 0.0006
```

`summary(gbmObj,plotit=FALSE)`

```
## var rel.inf
## gamPred gamPred 100
## population population 0
## total_units total_units 0
## vacant_units vacant_units 0
## median_rooms median_rooms 0
## mean_household_size_owners mean_household_size_owners 0
## mean_household_size_renters mean_household_size_renters 0
## median_household_income median_household_income 0
## mean_household_income mean_household_income 0
```

Now I will let it run for a while with a modest shrinkage penalty. We will then look at a plot of the MSE on the test set as a function of the number of trees.

```
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=1e4, shrinkage=0.03, verbose=FALSE)
ntree <- seq(100,gbmObj$n.trees,length.out=100)
gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree)
r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE)
mse <- apply(r^2,2,mean)
plot(ntree, mse, type="l")
abline(h=min(mse),lty="dashed")
text(ntree[100],min(mse)+0.0004,signif(min(mse),3))
```

If we decrease the shrinkage penalty from 0.03 to 0.005, notice what happens to the MSE on the test set:

```
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=1e4, shrinkage=0.005, verbose=FALSE)
ntree <- seq(500,gbmObj$n.trees,length.out=100)
gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree)
r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE)
mse <- apply(r^2,2,mean)
plot(ntree, mse, type="l")
abline(h=min(mse),lty="dashed")
text(ntree[100],min(mse)+0.0004,signif(min(mse),3))
```

Notice that the model does slightly worst after 2500 trees, but takes slightly longer to reach the optimal value. It also does not suffer as much when the number of trees is increased past this point. What if we try a still lower amount of shrinkage?

```
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=1e4, shrinkage=0.001, verbose=FALSE)
ntree <- seq(500,gbmObj$n.trees,length.out=100)
gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree)
r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE)
mse <- apply(r^2,2,mean)
plot(ntree, mse, type="l")
abline(h=min(mse),lty="dashed")
text(ntree[100],min(mse)+0.0004,signif(min(mse),3))
```

**As a general rule of thumb, you want to slowly decrease the shrinkage and produce curves like this, generally until the point where the shrinkage penalty is so small that it never really decays on the backend.** This is in bold because it is one of those important secrets of applied machine learning that is known well by a set of practitioners but not actually written down anywhere (prominent, at least).

For an ultimately model, we might choose to use this final model this MSE and pick only 8500 trees. Do you see now why we are treating the test set more like a validation set?

```
gbmYhat <- predict(gbmObj, Xtest, n.trees=8500)
mean((gbmYhat - ytest)^2)
```

`## [1] 0.1111789`

Now, notice that we can improve both the random forest and gradient boosted models by averaging there predictions. As the random forest is slightly more predictive, I will weight it a bit higher:

`mean((rfYhat - ytest)^2)`

`## [1] 0.1075728`

`mean((gbmYhat - ytest)^2)`

`## [1] 0.1111789`

`mean((rfYhat*0.8 + gbmYhat*0.2 - ytest)^2)`

`## [1] 0.107298`

Taking this a step farther, we can even weight together the random forest, gradient boosted tree, and additive model. This produces a slightly better model still.

```
gamYhat <- as.numeric(ca$gamPred[!trainFlag])
mean((rfYhat*0.8 + gbmYhat*0.1 + gamYhat*0.1 - ytest)^2)
```

`## [1] 0.1070845`

The process of blending together models is called *stacking* (amongst many other names), and is an important technique in producing state of the art predictive models. It is also possible to `learn’ the weights on a separate part of the validation set. In a way, this is what we did in using the additive model as part of the random forest and gradient boosted tree models.

I would like to note that while for this small example it seems that gradient boosting is a lot more work with little reward, this is not in general the case. Typically it will be able to significantly improve on the results of the random forest.

It may have seemed like I should use a separate chunk of data to fit the additive model as opposed to the random forest and gradient boosted trees. It is true that given a large enough dataset, this would be the ideal. However, I am not particularly worried about this effecting the output. Why? Additive models have a relatively low complexity when we only have 8 variables, and will rarely overfit to the training data.

In general, there is no big mysterious difference between classification and regression with decision trees. Some of the details are important to understand in terms of how these methods are employed in practice though, so I think it is useful to actually look at an example. I will take the same dataset as in the previous section:

```
set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
```

But now use the entire dataset, setting the class that we want to predict equal to which state the census tract is located in:

```
X <- x[, c(6,11:15,33:34)]
trainFlag <- (runif(nrow(x)) < 0.66)
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
```

We can run the random forest model, explicitly coding the y variables as factors to tell the function to run classification:

```
rfObj <- randomForest(Xtrain, factor(ytrain),
Xtest, factor(ytest),
do.trace=FALSE, keep.forest=TRUE,
ntree=500L)
rfYhat <- predict(rfObj, Xtest)
```

To evaluate the model, I will use the misclassification rate on the test set:

`mean(rfYhat != ytest)`

`## [1] 0.06598845`

Now, as before, let’s look at the variable importance scores for this model. Do these surprise you at all?

`importance(rfObj)`

```
## MeanDecreaseGini
## population 161.6781
## total_units 133.4970
## vacant_units 149.5611
## median_rooms 847.5819
## mean_household_size_owners 379.0625
## mean_household_size_renters 549.5154
## median_household_income 314.5436
## mean_household_income 373.0121
```

We can also fit a gradient boosted tree to the classification problem. In order to tell the *gbm.fit* function that we want classification, I will set the *distribution* parameter to the ‘bernoulli’ loss. How do the important variable differ from the random forest? How does the mis-classification rate compare to the random forest estimator?

```
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="bernoulli",
n.trees=1e4, shrinkage=0.005, verbose=FALSE)
gbmYhat <- as.numeric(predict(gbmObj, Xtest, n.trees=gbmObj$n.trees) > 0)
summary(gbmObj, plotit=FALSE)
```

```
## var rel.inf
## median_rooms median_rooms 37.7247915
## mean_household_size_renters mean_household_size_renters 27.0391940
## mean_household_income mean_household_income 17.6024644
## mean_household_size_owners mean_household_size_owners 10.0502211
## median_household_income median_household_income 5.7804208
## population population 0.8391667
## vacant_units vacant_units 0.5404817
## total_units total_units 0.4232599
```

`mean(gbmYhat != ytest)`

`## [1] 0.06406379`

As before, I want to fit an additive model to the training set and then use this as a meta-predictor in the random forest and gradient boosted tree models. However, instead of directly predicting the classification response (which we could do), I will instead use the additive model to predictive the most important variable in the tree models (median number of rooms):

```
library(mgcv)
x.gam2 <- gam(median_rooms
~ s(median_house_value) + s(mean_household_income)
+ s(population) + s(total_units) + s(vacant_units)
+ s(median_household_income) + s(mean_household_size_owners)
+ s(mean_household_size_renters),
data=x, subset=trainFlag)
```

I will now find the predictions of these variables and put these predictions in the predictor matrices **Xtrain** and **Xtest**.

```
x$gamPred <- predict(x.gam2, x)
X <- x[, c(6,11:15,33:35)]
trainFlag <- (runif(nrow(x)) < 0.66)
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
```

This gives an impressive improvement on the misclassification rate for the random forest model:

```
rfObj <- randomForest(Xtrain, factor(ytrain),
Xtest, factor(ytest),
do.trace=FALSE, keep.forest=TRUE,
ntree=500L)
rfYhat <- predict(rfObj, Xtest)
mean(rfYhat != ytest)
```

`## [1] 0.04337085`

To gain an understanding and appreciation for where we are still making errors, we can construct the *confusion matrix* of the predictions from the random forest estimator.

`table(rfYhat, ytest)`

```
## ytest
## rfYhat 0
```