Median Tract Income

The dataset for today’s lab uses U.S. Census data. Our prediction task is to predict the median income of a census tract (a small area approximately equal to a neighborhood in a city). There are a lot of variables in this data set, so it will be good to practice penalized regression with. The code below does a bit of data cleaning. I did not do this ahead of time because we will need the full dataset later in the semester. It removes a few variables that I do not want to use and restricts the data to only the lower-48 states.

set.seed(1)

tract <- read_csv(file.path("data", "tract_median_income.csv")) %>%
  mutate(train_id = if_else(runif(n()) < 0.1, "train", "valid")) %>%
  select(-income_q1, -income_q2, -income_q3, -income_q4, -income_p95) %>%
  filter(!(state %in% c("HI", "AK")))

tract
## # A tibble: 63,591 x 64
##    median_income   lon   lat state county cbsa_name cbsa_type total_population
##            <dbl> <dbl> <dbl> <chr> <chr>  <chr>     <chr>                <dbl>
##  1         60000 -86.5  32.5 AL    Autau… Montgome… Metropol…              232
##  2         42971 -86.5  32.5 AL    Autau… Montgome… Metropol…              261
##  3         43717 -86.5  32.5 AL    Autau… Montgome… Metropol…              264
##  4         55814 -86.4  32.5 AL    Autau… Montgome… Metropol…              499
##  5         57349 -86.4  32.5 AL    Autau… Montgome… Metropol…              391
##  6         56985 -86.5  32.4 AL    Autau… Montgome… Metropol…              267
##  7         37137 -86.4  32.4 AL    Autau… Montgome… Metropol…              232
##  8         72917 -86.5  32.4 AL    Autau… Montgome… Metropol…              248
##  9         56681 -86.5  32.5 AL    Autau… Montgome… Metropol…              531
## 10         48033 -86.5  32.6 AL    Autau… Montgome… Metropol…              271
## # … with 63,581 more rows, and 56 more variables: total_households <dbl>,
## #   median_age <dbl>, age_00_02 <dbl>, age_03_04 <dbl>, age_05_05 <dbl>,
## #   age_06_08 <dbl>, age_09_11 <dbl>, age_12_14 <dbl>, age_15_18 <dbl>,
## #   same_house <dbl>, same_county <dbl>, same_state <dbl>, avg_duration <dbl>,
## #   car_alone <dbl>, carpool <dbl>, public_transit <dbl>, walked <dbl>,
## #   transit_other <dbl>, at_home <dbl>, white_alone <dbl>, black_alone <dbl>,
## #   native_american_alone <dbl>, asian_alone <dbl>, pacific_alone <dbl>,
## #   other_alone <dbl>, multiracial <dbl>, healthcare_private <dbl>,
## #   healthcare_public <dbl>, healthcare_none <dbl>, housing_costs <dbl>,
## #   housing_000000_000000 <dbl>, housing_010000_014999 <dbl>,
## #   housing_015000_019999 <dbl>, housing_020000_024999 <dbl>,
## #   housing_025000_029999 <dbl>, housing_030000_034999 <dbl>,
## #   housing_035000_039999 <dbl>, housing_040000_049999 <dbl>,
## #   housing_050000_059999 <dbl>, housing_060000_069999 <dbl>,
## #   housing_070000_079999 <dbl>, housing_080000_089999 <dbl>,
## #   housing_090000_099999 <dbl>, housing_100000_124999 <dbl>,
## #   housing_125000_149999 <dbl>, housing_150000_174999 <dbl>,
## #   housing_175000_199999 <dbl>, housing_200000_249999 <dbl>,
## #   housing_250000_299999 <dbl>, housing_300000_399999 <dbl>,
## #   housing_400000_499999 <dbl>, housing_500000_749999 <dbl>,
## #   housing_750000_999999 <dbl>, housing_above_1_million <dbl>, gini <dbl>,
## #   train_id <chr>

To speed up the model creation time, I have only put 10% of the data in the training set.

Example Analysis: CBSA

In each section of this lab, you will create one or more penalized regression models. In this first section I have written most of the code for you to make it easier to copy this in the following section (these will be good templates going forward anyway). To start, let’s build a model that only uses the variable “cbsa_name”. This variable provides a name to metropolitan areas in the country. The cv.glmnet function has a few extra parameters that let us see how the model is running and speeds up the process. We will take about these more next time.

mf <- model.frame(median_income ~ cbsa_name,
                  data = select(tract, -train_id))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]

model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)

The following code shows all of the non-zero coefficients for the 22nd lambda value. It will be better to only look at the non-zero terms as our datasets grow in size.

beta <- coef(model, s = model$lambda[22])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
colnames(beta) <- names(beta)
beta
## 27 x 1 sparse Matrix of class "dgCMatrix"
##                                                       [,1]      
## (Intercept)                                           54650.7456
## cbsa_nameAtlanta-Sandy Springs-Roswell, GA             2407.5666
## cbsa_nameAugusta-Richmond County, GA-SC               -1961.1512
## cbsa_nameBaltimore-Columbia-Towson, MD                 9352.8002
## cbsa_nameBoston-Cambridge-Newton, MA-NH               15606.2891
## cbsa_nameBridgeport-Stamford-Norwalk, CT              20391.6464
## cbsa_nameCalifornia-Lexington Park, MD                  414.5017
## cbsa_nameChicago-Naperville-Elgin, IL-IN-WI            3376.4368
## cbsa_nameDallas-Fort Worth-Arlington, TX               6523.4106
## cbsa_nameDenver-Aurora-Lakewood, CO                    7107.4090
## cbsa_nameHartford-West Hartford-East Hartford, CT       593.1613
## cbsa_nameHouston-The Woodlands-Sugar Land, TX          2716.8657
## cbsa_nameLos Angeles-Long Beach-Anaheim, CA            7555.2000
## cbsa_nameMinneapolis-St. Paul-Bloomington, MN-WI      10223.2017
## cbsa_nameMobile, AL                                   -1175.6250
## cbsa_nameNew York-Newark-Jersey City, NY-NJ-PA        17698.9232
## cbsa_nameOxnard-Thousand Oaks-Ventura, CA             12871.7050
## cbsa_namePhiladelphia-Camden-Wilmington, PA-NJ-DE-MD    836.2551
## cbsa_nameSalt Lake City, UT                            1936.7845
## cbsa_nameSan Diego-Carlsbad, CA                        7876.5283
## cbsa_nameSan Francisco-Oakland-Hayward, CA            25095.9045
## cbsa_nameSan Jose-Sunnyvale-Santa Clara, CA           37812.9999
## cbsa_nameSeattle-Tacoma-Bellevue, WA                   7659.4496
## cbsa_nameShreveport-Bossier City, LA                   -712.3144
## cbsa_nameSummit Park, UT                              13741.4287
## cbsa_nameToledo, OH                                   -1620.1958
## cbsa_nameWashington-Arlington-Alexandria, DC-VA-MD-WV 43385.4741

Take not of areas seem to have particularly large or small incomes. Does this generally fit your perception of U.S. cities? Answer Hopefully it does!

In the code below, select only those coefficients for the 10th lambda. Notice that only a very variables are selected.

beta <- coef(model, s = model$lambda[10])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                                                               1
## (Intercept)                                           57468.111
## cbsa_nameNew York-Newark-Jersey City, NY-NJ-PA         8291.623
## cbsa_nameSan Francisco-Oakland-Hayward, CA             8498.456
## cbsa_nameSan Jose-Sunnyvale-Santa Clara, CA           12266.431
## cbsa_nameWashington-Arlington-Alexandria, DC-VA-MD-WV 27232.839

In the next block, we can see how well this model performs in predicting the output variable:

tract %>%
  mutate(pred = predict(model, newx = X)) %>%
  group_by(train_id) %>%
  summarize(rmse = sqrt(mean((median_income - pred)^2)))
## # A tibble: 2 x 2
##   train_id   rmse
##   <chr>     <dbl>
## 1 train    26662.
## 2 valid    26894.

We will have more to compare it to in the following sections.

Model 2: Numeric Variables

For the next model, we we are going to fit a lasso regression using all of the numeric variables in the dataset as predictors. You can do this by removing the variables “train_id”, “state”, “cbsa_name”, and “county” and then setting the formula to be “median_income ~ .”.

mf <- model.frame(median_income ~ .,
                  data = select(tract, -train_id, -state, -cbsa_name, -county))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]

model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)

Show all of the coefficients for the 15th value of lambda:

beta <- coef(model, s = model$lambda[15])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
## 3 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)        3248.316188
## healthcare_private  385.317606
## housing_costs         2.373128

Next, look at the 27th lambda:

beta <- coef(model, s = model$lambda[27])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
## 9 x 1 sparse Matrix of class "dgCMatrix"
##                                     1
## (Intercept)             -28513.037804
## median_age                 112.946432
## same_house               14744.406802
## same_county              -2596.195943
## walked                    -122.667152
## white_alone                 15.640305
## healthcare_private         541.255021
## housing_costs                2.726949
## housing_above_1_million    107.933532

And now show those for the optimal lambda (you can leave off the parameter s to get this).

beta <- coef(model)
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                                     1
## (Intercept)             -2.994128e+04
## total_population         7.091646e-01
## median_age               1.550750e+02
## age_09_11                1.410345e+02
## same_house               1.752496e+04
## same_county             -7.124264e+03
## car_alone                4.103555e+01
## walked                  -2.670699e+02
## at_home                  2.352481e+02
## white_alone              3.841514e+01
## native_american_alone    2.778752e+01
## other_alone             -3.124272e+01
## healthcare_private       5.344130e+02
## housing_costs            2.747135e+00
## housing_150000_174999   -1.460025e+01
## housing_175000_199999   -9.863579e+00
## housing_750000_999999    2.348928e+00
## housing_above_1_million  3.006684e+02
## gini                    -1.893985e+04

Take a moment to look at the results in the last three questions and understand what they tell us about the data.

Finally, compute the RMSE of the dataset and compare to the previous model:

tract %>%
  mutate(pred = predict(model, newx = X)) %>%
  group_by(train_id) %>%
  summarize(rmse = sqrt(mean((median_income - pred)^2)))
## # A tibble: 2 x 2
##   train_id   rmse
##   <chr>     <dbl>
## 1 train    10285.
## 2 valid    10487.

Model 3: Numeric Variables with Elastic Net

Now, we will fit the same model as in the previous part, but will add back in the cbsa_area variable.

mf <- model.frame(median_income ~ .,
                  data = select(tract, -train_id, -state, -county))
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]

model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)

Look at the coefficients for the 25th value of lambda. Are any of these CBSA areas? Answer No.

beta <- coef(model, s = model$lambda[25])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
## 9 x 1 sparse Matrix of class "dgCMatrix"
##                                     1
## (Intercept)             -26485.437898
## median_age                 101.343812
## same_house               14180.171639
## same_county               -397.999568
## walked                     -47.611539
## white_alone                  4.142824
## healthcare_private         535.999557
## housing_costs                2.707372
## housing_above_1_million     61.456585

Moving down, try now to see the 35th value of lambda. You should now see a few cities come into the model. Are these the same that intially popped up before?

beta <- coef(model, s = model$lambda[35])
beta <- beta[apply(beta != 0, 1, any),,drop=FALSE]
beta
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                                                   1
## (Intercept)                                           -31189.610901
## cbsa_nameHouston-The Woodlands-Sugar Land, TX           2148.678483
## cbsa_nameLos Angeles-Long Beach-Anaheim, CA            -2265.523641
## cbsa_nameWashington-Arlington-Alexandria, DC-VA-MD-WV   3111.202238
## median_age                                               141.304593
## same_house                                             17693.045985
## same_county                                            -5526.331104
## car_alone                                                 30.385117
## walked                                                  -236.725382
## at_home                                                  167.709759
## white_alone                                               36.491057
## healthcare_private                                       533.218995
## housing_costs                                              2.753512
## housing_above_1_million                                  257.581914
## gini                                                  -12117.173552

And now compute the RMSE for the model. How does it compare to the previous two models? Answer: It is noticably better than both of the last two.

tract %>%
  mutate(pred = predict(model, newx = X)) %>%
  group_by(train_id) %>%
  summarize(rmse = sqrt(mean((median_income - pred)^2)))
## # A tibble: 2 x 2
##   train_id  rmse
##   <chr>    <dbl>
## 1 train    9093.
## 2 valid    9444.

Model 4: Latitude and Longitude

For our last section, start by creating a regression model that uses just the lon and lat variables within a lasso regression:

mf <- model.frame(median_income ~ lon + lat,
                  data = tract)
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]

model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)

Find the RMSE; you will see that it is not as good as the previous models.

tract %>%
  mutate(pred = predict(model, newx = X)) %>%
  group_by(train_id) %>%
  summarize(rmse = sqrt(mean((median_income - pred)^2)))
## # A tibble: 2 x 2
##   train_id   rmse
##   <chr>     <dbl>
## 1 train    29187.
## 2 valid    29172.

In the code below, I have created a model that fits a 50x50 grid over the U.S. and allows for a different coefficent to be applied to each grid point (we will visualize this in just a moment).

mf <- model.frame(median_income ~ cut(lon, breaks = 50) * cut(lat, breaks = 50),
                  data = tract)
mt <- attr(mf, "terms")
y <- model.response(mf)
X <- model.matrix(mt, mf)

X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]

model <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 3, trace.it = FALSE)

Compute the RMSE of this model. You should see that it does slightly better than the CBSA model.

tract %>%
  mutate(pred = predict(model, newx = X)) %>%
  group_by(train_id) %>%
  summarize(rmse = sqrt(mean((median_income - pred)^2)))
## # A tibble: 2 x 2
##   train_id   rmse
##   <chr>     <dbl>
## 1 train    26719.
## 2 valid    27082.

Plots!

Using the last model we created (the 50x50 grid of lat and lon), predict the values of the model and draw a scatter plot with longitude on the x-axis, latitude on the y-axis, and the points colored by the predictions. I suggest using a size of 0.2 and the color scale scale_color_distiller(palette = "Spectral").

tract %>%
  mutate(pred = predict(model, newx = X)) %>%
  ggplot(aes(lon, lat)) +
    geom_point(aes(color = pred), size = 0.2) +
    scale_color_distiller(palette = "Spectral")