Letâ€™s now use a support vector machine to build a classification algorithm. Here, I am using the **e1071** package in R, which is essentially a wrapper to the LIBSVM library. It is based on the algorithm we described today (i.e., directly solving the dual problem).

We will start by looking the housing data.

```
set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
```

And construct the same training and testing sets as from the previous class:

```
X <- x[, c(6,11:15,33:34)]
trainFlag <- (runif(nrow(x)) < 0.66)
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
```

Fit a linear SVM; the function will try to determine whether you want to do classification or regression (we have not discussed the extension to SVM for regression). I prefer to manually specify it because it can be subtle to miss the difference:

```
library(e1071)
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="linear")
out
```

```
##
## Call:
## svm.default(x = Xtrain, y = ytrain, type = "C-classification",
## kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.125
##
## Number of Support Vectors: 1435
```

And then predict the results using the predict function, noting that you need to subtract one from the results to match up to the input.

```
svmYhat <- as.numeric(predict(out, Xtest)) - 1L
mean(ytest != svmYhat)
```

`## [1] 0.07451196`

Specifying a linear kernel, which attempts to find a hyperplane in space to separate the classes, notice that this performs roughly as well as the tree based classification algorithms.

If we set the *kernel* parameter to the â€˜radialâ€™ kernel,

```
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="radial")
out
```

```
##
## Call:
## svm.default(x = Xtrain, y = ytrain, type = "C-classification",
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.125
##
## Number of Support Vectors: 1389
```

The support vector machine performs significantly better than the base model:

```
svmYhat <- as.numeric(predict(out, Xtest)) - 1L
mean(ytest != svmYhat)
```

`## [1] 0.0577399`

We can change the cost function, and in this case observe an improvement over the default cost:

```
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="radial", cost=5)
svmYhat <- as.numeric(predict(out, Xtest)) - 1L
mean(ytest != svmYhat)
```

`## [1] 0.05361562`

As before, I want to fit an additive model to the training set and then use this as a meta-predictor in the random forest and gradient boosted tree models. However, instead of directly predicting the classification response (which we could do), I will instead use the additive model to predictive the most important variable in the tree models (median number of rooms):

`library(mgcv)`

`## Loading required package: nlme`

`## This is mgcv 1.8-7. For overview type 'help("mgcv-package")'.`

```
x.gam2 <- gam(median_rooms
~ s(median_house_value) + s(mean_household_income)
+ s(population) + s(total_units) + s(vacant_units)
+ s(median_household_income) + s(mean_household_size_owners)
+ s(mean_household_size_renters),
data=x, subset=trainFlag)
```

We then reconstruct the test and training matrices:

```
x$gamPred <- predict(x.gam2, x)
X <- x[, c(6,11:15,33:35)]
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
```

And re-run the random forest estimator:

`library(randomForest)`

`## randomForest 4.6-12`

`## Type rfNews() to see new features/changes/bug fixes.`

```
rfObj <- randomForest(Xtrain, factor(ytrain),
Xtest, factor(ytest),
do.trace=FALSE, keep.forest=TRUE,
ntree=500L)
rfYhat <- predict(rfObj, Xtest)
```

As well as the support vector machines:

```
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="linear", cost=100)
svmLinearYhat <- as.numeric(predict(out, Xtest)) - 1L
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="radial")
svmRadialYhat <- as.numeric(predict(out, Xtest)) - 1L
```

And compare the results:

`mean(ytest != rfYhat)`

`## [1] 0.04756668`

`mean(ytest != svmLinearYhat)`

`## [1] 0.04839153`

`mean(ytest != svmRadialYhat)`

`## [1] 0.03574374`

And so, the support vector machine performs significantly better than the random forest in this example, when we use a non-linear kernel. Note that we could further try to optimize the cost and gamma parameters to get a slightly better overall fit.

How different are the predicted classes from the support vector machine compared to the random forest?

`table(svmRadialYhat, rfYhat)`

```
## rfYhat
## svmRadialYhat 0 1
## 0 967 66
## 1 25 2579
```

In order to do some form of stacking, we need to compute both estimators on a probability scale. This can be done as follows:

```
rfYhatProb <- predict(rfObj, Xtest, type="prob")[,2]
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="radial", probability=TRUE)
svmRadialYhatProb <- attributes(predict(out, Xtest, probability=TRUE))$probabilities[,1]
cor(rfYhatProb, svmRadialYhatProb)
```

`## [1] 0.9686548`

Notice that again, the two estimators are quite similar but not exactly the same. If we blend these together, we get a slight improvement on the SVM:

```
newYhat <- as.numeric(rfYhatProb*0.4 + svmRadialYhatProb*0.6 > 0.5)
mean(newYhat != ytest)
```

`## [1] 0.03464394`

To gain an understanding and appreciation for where we are still making errors, we can construct the *confusion matrix* of the predictions from the random forest estimator.

`table(svmRadialYhat, ytest)`

```
## ytest
## svmRadialYhat 0 1
## 0 978 55
## 1 75 2529
```

It appears that we are not overly biased towards predicting points in either CA or PA. Where do these errant points occur?

```
these <- which(svmRadialYhat != ytest)
par(mar=c(0,0,0,0))
plot(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
axes=FALSE, xlab="", ylab="", col="white")
snippets::osmap(tiles.url="http://c.tile.stamen.com/toner/",alpha=0.5)
points(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
pch=19, cex=0.4, col="orange")
box()
```

Switching gears for a moment, letâ€™s consider the problem again of predicting the median house price for just the California data.

Say that we wanted to utilize the county codes in a predictive model (this is quite relevant to problem set 3); how might we do this, particularly in support vector machines, which do not natively handle categorical variables? One way is to build a hierarchical model, which first ranks the counties in some fashion and then uses that ranking as a predictor variable.

```
set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
ca <- x[x$statefp==6,] # just take CA data
trainFlag <- (runif(nrow(ca)) < 0.66)
```

This hierarchical model does not need to be overly complex. For example, just take the median house value for each tract:

```
tab <- tapply(ca$median_house_value[trainFlag], ca$countyfp[trainFlag], mean)
tab
```

```
## 1 3 5 7 9 11 13 15
## 515177.3 395600.0 323742.9 256211.4 366466.7 213350.0 454139.4 207266.7
## 17 19 21 23 25 27 29 31
## 405636.0 234236.5 258660.0 303747.4 160538.1 289350.0 188555.9 252017.6
## 33 35 37 39 41 43 45 47
## 216500.0 211700.0 452806.0 237864.7 757447.6 264250.0 401793.3 203094.1
## 49 51 53 55 57 59 61 63
## 151866.7 419533.3 414477.6 510805.3 410490.9 522405.8 403832.7 217875.0
## 65 67 69 71 73 75 77 79
## 273487.7 282013.2 451857.1 264718.3 434141.0 701880.2 227706.0 527571.0
## 81 83 85 87 89 91 93 95
## 663035.7 503654.5 589356.1 591994.3 257251.9 304700.0 197236.4 333773.3
## 97 99 101 103 105 107 109 111
## 499596.6 219882.0 247230.8 194733.3 270175.0 184054.5 302162.5 496460.7
## 113 115
## 433689.3 181300.0
```

And attach this to the data set:

```
index <- match(ca$countyfp, names(tab))
ca$countyMHP <- tab[index]
X <- ca[, c(6,11:15,33:35)]
y <- log(ca$median_house_value)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
```

Now, we can use this a random forest with this meta-predictor variable

```
rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
do.trace=FALSE, keep.forest=TRUE,
ntree=500)
rfObj
```

```
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, ntree = 500, do.trace = FALSE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.07314528
## % Var explained: 73.11
## Test set MSE: 0.08
## % Var explained: 72.72
```

`importance(rfObj)`

```
## IncNodePurity
## population 43.21423
## total_units 38.21183
## vacant_units 43.61819
## median_rooms 87.65921
## mean_household_size_owners 59.91909
## mean_household_size_renters 83.30742
## median_household_income 191.94969
## mean_household_income 354.59927
## countyMHP 391.37913
```

Which is significantly better than the same model without the county summary statistic, which is incidentally now the most important variable in the random forest.