The dataset for today’s lab uses U.S. Census data. Our prediction task is to predict the median income of a census tract (a small area approximately equal to a neighborhood in a city). There are a lot of variables in this data set, so it will be good to practice penalized regression with. The code below does a bit of data cleaning. I did not do this ahead of time because we will need the full dataset later in the semester. It removes a few variables that I do not want to use and restricts the data to only the lower-48 states.

```
set.seed(1)
tract <- read_csv(file.path("data", "tract_median_income.csv")) %>%
mutate(train_id = if_else(runif(n()) < 0.1, "train", "valid")) %>%
select(-income_q1, -income_q2, -income_q3, -income_q4, -income_p95) %>%
filter(!(state %in% c("HI", "AK")))
tract
```

```
## # A tibble: 63,591 x 64
## median_income lon lat state county cbsa_name cbsa_type total_population
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 60000 -86.5 32.5 AL Autau… Montgome… Metropol… 232
## 2 42971 -86.5 32.5 AL Autau… Montgome… Metropol… 261
## 3 43717 -86.5 32.5 AL Autau… Montgome… Metropol… 264
## 4 55814 -86.4 32.5 AL Autau… Montgome… Metropol… 499
## 5 57349 -86.4 32.5 AL Autau… Montgome… Metropol… 391
## 6 56985 -86.5 32.4 AL Autau… Montgome… Metropol… 267
## 7 37137 -86.4 32.4 AL Autau… Montgome… Metropol… 232
## 8 72917 -86.5 32.4 AL Autau… Montgome… Metropol… 248
## 9 56681 -86.5 32.5 AL Autau… Montgome… Metropol… 531
## 10 48033 -86.5 32.6 AL Autau… Montgome… Metropol… 271
## # … with 63,581 more rows, and 56 more variables: total_households <dbl>,
## # median_age <dbl>, age_00_02 <dbl>, age_03_04 <dbl>, age_05_05 <dbl>,
## # age_06_08 <dbl>, age_09_11 <dbl>, age_12_14 <dbl>, age_15_18 <dbl>,
## # same_house <dbl>, same_county <dbl>, same_state <dbl>, avg_duration <dbl>,
## # car_alone <dbl>, carpool <dbl>, public_transit <dbl>, walked <dbl>,
## # transit_other <dbl>, at_home <dbl>, white_alone <dbl>, black_alone <dbl>,
## # native_american_alone <dbl>, asian_alone <dbl>, pacific_alone <dbl>,
## # other_alone <dbl>, multiracial <dbl>, healthcare_private <dbl>,
## # healthcare_public <dbl>, healthcare_none <dbl>, housing_costs <dbl>,
## # housing_000000_000000 <dbl>, housing_010000_014999 <dbl>,
## # housing_015000_019999 <dbl>, housing_020000_024999 <dbl>,
## # housing_025000_029999 <dbl>, housing_030000_034999 <dbl>,
## # housing_035000_039999 <dbl>, housing_040000_049999 <dbl>,
## # housing_050000_059999 <dbl>, housing_060000_069999 <dbl>,
## # housing_070000_079999 <dbl>, housing_080000_089999 <dbl>,
## # housing_090000_099999 <dbl>, housing_100000_124999 <dbl>,
## # housing_125000_149999 <dbl>, housing_150000_174999 <dbl>,
## # housing_175000_199999 <dbl>, housing_200000_249999 <dbl>,
## # housing_250000_299999 <dbl>, housing_300000_399999 <dbl>,
## # housing_400000_499999 <dbl>, housing_500000_749999 <dbl>,
## # housing_750000_999999 <dbl>, housing_above_1_million <dbl>, gini <dbl>,
## # train_id <chr>
```

To speed up the model creation time, I have only put 10% of the data in the training set.

In each section of this lab, you will create one or more penalized regression models. In this first section I have written most of the code for you to make it easier to copy this in the following section (these will be good templates going forward anyway). To start, let’s build a model that only uses the variable “cbsa_name”. This variable provides a name to metropolitan areas in the country. The `cv.glmnet`

function has a few extra parameters that let us see how the model is running and speeds up the process. We will take about these more next time.

```
mf <- model.frame(median_income ~ cbsa_name,
data = select(tract, -train_id))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)
X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]
model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)
```

The following code shows all of the non-zero coefficients for the 22nd lambda value. It will be better to only look at the non-zero terms as our datasets grow in size.

```
beta <- coef(model, s = model$lambda[22])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
colnames(beta) <- names(beta)
beta
```

```
## 27 x 1 sparse Matrix of class "dgCMatrix"
## [,1]
## (Intercept) 54650.7456
## cbsa_nameAtlanta-Sandy Springs-Roswell, GA 2407.5666
## cbsa_nameAugusta-Richmond County, GA-SC -1961.1512
## cbsa_nameBaltimore-Columbia-Towson, MD 9352.8002
## cbsa_nameBoston-Cambridge-Newton, MA-NH 15606.2891
## cbsa_nameBridgeport-Stamford-Norwalk, CT 20391.6464
## cbsa_nameCalifornia-Lexington Park, MD 414.5017
## cbsa_nameChicago-Naperville-Elgin, IL-IN-WI 3376.4368
## cbsa_nameDallas-Fort Worth-Arlington, TX 6523.4106
## cbsa_nameDenver-Aurora-Lakewood, CO 7107.4090
## cbsa_nameHartford-West Hartford-East Hartford, CT 593.1613
## cbsa_nameHouston-The Woodlands-Sugar Land, TX 2716.8657
## cbsa_nameLos Angeles-Long Beach-Anaheim, CA 7555.2000
## cbsa_nameMinneapolis-St. Paul-Bloomington, MN-WI 10223.2017
## cbsa_nameMobile, AL -1175.6250
## cbsa_nameNew York-Newark-Jersey City, NY-NJ-PA 17698.9232
## cbsa_nameOxnard-Thousand Oaks-Ventura, CA 12871.7050
## cbsa_namePhiladelphia-Camden-Wilmington, PA-NJ-DE-MD 836.2551
## cbsa_nameSalt Lake City, UT 1936.7845
## cbsa_nameSan Diego-Carlsbad, CA 7876.5283
## cbsa_nameSan Francisco-Oakland-Hayward, CA 25095.9045
## cbsa_nameSan Jose-Sunnyvale-Santa Clara, CA 37812.9999
## cbsa_nameSeattle-Tacoma-Bellevue, WA 7659.4496
## cbsa_nameShreveport-Bossier City, LA -712.3144
## cbsa_nameSummit Park, UT 13741.4287
## cbsa_nameToledo, OH -1620.1958
## cbsa_nameWashington-Arlington-Alexandria, DC-VA-MD-WV 43385.4741
```

Take not of areas seem to have particularly large or small incomes. Does this generally fit your perception of U.S. cities? **Answer** Hopefully it does!

In the code below, select only those coefficients for the 10th lambda. Notice that only a very variables are selected.

```
beta <- coef(model, s = model$lambda[10])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
```

```
## 5 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 57468.111
## cbsa_nameNew York-Newark-Jersey City, NY-NJ-PA 8291.623
## cbsa_nameSan Francisco-Oakland-Hayward, CA 8498.456
## cbsa_nameSan Jose-Sunnyvale-Santa Clara, CA 12266.431
## cbsa_nameWashington-Arlington-Alexandria, DC-VA-MD-WV 27232.839
```

In the next block, we can see how well this model performs in predicting the output variable:

```
tract %>%
mutate(pred = predict(model, newx = X)) %>%
group_by(train_id) %>%
summarize(rmse = sqrt(mean((median_income - pred)^2)))
```

```
## # A tibble: 2 x 2
## train_id rmse
## <chr> <dbl>
## 1 train 26662.
## 2 valid 26894.
```

We will have more to compare it to in the following sections.

For the next model, we we are going to fit a lasso regression using all of the numeric variables in the dataset as predictors. You can do this by removing the variables “train_id”, “state”, “cbsa_name”, and “county” and then setting the formula to be “median_income ~ .”.

```
mf <- model.frame(median_income ~ .,
data = select(tract, -train_id, -state, -cbsa_name, -county))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)
X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]
model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)
```

Show all of the coefficients for the 15th value of lambda:

```
beta <- coef(model, s = model$lambda[15])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
```

```
## 3 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3248.316188
## healthcare_private 385.317606
## housing_costs 2.373128
```

Next, look at the 27th lambda:

```
beta <- coef(model, s = model$lambda[27])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
```

```
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -28513.037804
## median_age 112.946432
## same_house 14744.406802
## same_county -2596.195943
## walked -122.667152
## white_alone 15.640305
## healthcare_private 541.255021
## housing_costs 2.726949
## housing_above_1_million 107.933532
```

And now show those for the optimal lambda (you can leave off the parameter `s`

to get this).

```
beta <- coef(model)
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
```

```
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -2.994128e+04
## total_population 7.091646e-01
## median_age 1.550750e+02
## age_09_11 1.410345e+02
## same_house 1.752496e+04
## same_county -7.124264e+03
## car_alone 4.103555e+01
## walked -2.670699e+02
## at_home 2.352481e+02
## white_alone 3.841514e+01
## native_american_alone 2.778752e+01
## other_alone -3.124272e+01
## healthcare_private 5.344130e+02
## housing_costs 2.747135e+00
## housing_150000_174999 -1.460025e+01
## housing_175000_199999 -9.863579e+00
## housing_750000_999999 2.348928e+00
## housing_above_1_million 3.006684e+02
## gini -1.893985e+04
```

Take a moment to look at the results in the last three questions and understand what they tell us about the data.

Finally, compute the RMSE of the dataset and compare to the previous model:

```
tract %>%
mutate(pred = predict(model, newx = X)) %>%
group_by(train_id) %>%
summarize(rmse = sqrt(mean((median_income - pred)^2)))
```

```
## # A tibble: 2 x 2
## train_id rmse
## <chr> <dbl>
## 1 train 10285.
## 2 valid 10487.
```

Now, we will fit the same model as in the previous part, but will add back in the cbsa_area variable.

```
mf <- model.frame(median_income ~ .,
data = select(tract, -train_id, -state, -county))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)
X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]
model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)
```

Look at the coefficients for the 25th value of lambda. Are any of these CBSA areas? **Answer** No.

```
beta <- coef(model, s = model$lambda[25])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
```

```
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -26485.437898
## median_age 101.343812
## same_house 14180.171639
## same_county -397.999568
## walked -47.611539
## white_alone 4.142824
## healthcare_private 535.999557
## housing_costs 2.707372
## housing_above_1_million 61.456585
```

Moving down, try now to see the 35th value of lambda. You should now see a few cities come into the model. Are these the same that intially popped up before?

```
beta <- coef(model, s = model$lambda[35])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
```

```
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -31189.610901
## cbsa_nameHouston-The Woodlands-Sugar Land, TX 2148.678483
## cbsa_nameLos Angeles-Long Beach-Anaheim, CA -2265.523641
## cbsa_nameWashington-Arlington-Alexandria, DC-VA-MD-WV 3111.202238
## median_age 141.304593
## same_house 17693.045985
## same_county -5526.331104
## car_alone 30.385117
## walked -236.725382
## at_home 167.709759
## white_alone 36.491057
## healthcare_private 533.218995
## housing_costs 2.753512
## housing_above_1_million 257.581914
## gini -12117.173552
```

And now compute the RMSE for the model. How does it compare to the previous two models? **Answer**: It is noticably better than both of the last two.

```
tract %>%
mutate(pred = predict(model, newx = X)) %>%
group_by(train_id) %>%
summarize(rmse = sqrt(mean((median_income - pred)^2)))
```

```
## # A tibble: 2 x 2
## train_id rmse
## <chr> <dbl>
## 1 train 9093.
## 2 valid 9444.
```

For our last section, start by creating a regression model that uses just the lon and lat variables within a lasso regression:

```
mf <- model.frame(median_income ~ lon + lat,
data = tract)
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)
X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]
model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)
```

Find the RMSE; you will see that it is not as good as the previous models.

```
tract %>%
mutate(pred = predict(model, newx = X)) %>%
group_by(train_id) %>%
summarize(rmse = sqrt(mean((median_income - pred)^2)))
```

```
## # A tibble: 2 x 2
## train_id rmse
## <chr> <dbl>
## 1 train 29187.
## 2 valid 29172.
```

In the code below, I have created a model that fits a 50x50 grid over the U.S. and allows for a different coefficent to be applied to each grid point (we will visualize this in just a moment).

```
mf <- model.frame(median_income ~ cut(lon, breaks = 50) * cut(lat, breaks = 50),
data = tract)
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)
X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]
model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)
```

Compute the RMSE of this model. You should see that it does slightly better than the CBSA model.

```
tract %>%
mutate(pred = predict(model, newx = X)) %>%
group_by(train_id) %>%
summarize(rmse = sqrt(mean((median_income - pred)^2)))
```

```
## # A tibble: 2 x 2
## train_id rmse
## <chr> <dbl>
## 1 train 26719.
## 2 valid 27082.
```

Using the last model we created (the 50x50 grid of lat and lon), predict the values of the model and draw a scatter plot with longitude on the x-axis, latitude on the y-axis, and the points colored by the predictions. I suggest using a size of 0.2 and the color scale `scale_color_distiller(palette = "Spectral")`

.

```
tract %>%
mutate(pred = predict(model, newx = X)) %>%
ggplot(aes(lon, lat)) +
geom_point(aes(color = pred), size = 0.2) +
scale_color_distiller(palette = "Spectral")
```