This is the first “lab” for the semester. I have included several empty code chunks and text prompts that start with “Answer:”. You should fill in solutions to these (code in the first case and a short answer in the second). These are not handed in or graded and solutions will be posted for next class. I suggest generally working as a group with a single shared screen, but you might want to also duplicate the code in your own machine.

California House Prices

In this lab you will apply the methods shown in the notes to a data set of housing prices from California. Our variable of interest is the median house sale price in each census tract.

set.seed(1)

ca <- read_csv(file.path("data", "ca_house_price.csv")) %>%
  mutate(train_id = if_else(runif(n()) < 0.6, "train", "valid"))
ca
## # A tibble: 5,984 x 29
##    median_house_va… population latitude longitude total_units vacant_units
##               <dbl>      <dbl>    <dbl>     <dbl>       <dbl>        <dbl>
##  1             910.       1974     37.8     -122.         929           37
##  2             749.       4865     37.8     -122.        2655          134
##  3             774.       3703     37.8     -122.        1911           68
##  4             481.       1571     37.8     -122.         781           65
##  5             439.       2302     37.8     -122.        1202           80
##  6             370.       5678     37.8     -122.        2665          500
##  7             467.       4156     37.8     -122.        2182          148
##  8             591.       2416     37.8     -122.        1372           33
##  9             352.       3528     37.8     -122.        1908          227
## 10             323.       4314     37.8     -122.        1870          432
## # … with 5,974 more rows, and 23 more variables: median_rooms <dbl>,
## #   mean_household_size_owners <dbl>, mean_household_size_renters <dbl>,
## #   built_2005_or_later <dbl>, built_2000_to_2004 <dbl>, built_1990s <dbl>,
## #   built_1980s <dbl>, built_1970s <dbl>, built_1960s <dbl>, built_1950s <dbl>,
## #   built_1940s <dbl>, built_1939_or_earlier <dbl>, bedrooms_0 <dbl>,
## #   bedrooms_1 <dbl>, bedrooms_2 <dbl>, bedrooms_3 <dbl>, bedrooms_4 <dbl>,
## #   bedrooms_5_or_more <dbl>, owners <dbl>, renters <dbl>,
## #   median_household_income <dbl>, mean_household_income <dbl>, train_id <chr>

There are several numeric variables that you can use to predict the output, all of which should be reasonably self-explanatory.

Exploring the Data

To start, draw a histogram with 15 bins showing the distribution of the median house value variable:

ca %>%
  ggplot(aes(median_house_value)) +
    geom_histogram(bins = 15, color = "black", fill = "white")

What are the most typical values of the variable (they are given in thousands of dollars)? Answer: There are many values around $300k dollars, with a long tail ranging from near $0 to just under $1.

Now, draw a scatter plot showing the relationship between the median household income and the median house value.

ca %>%
  ggplot(aes(median_household_income, median_house_value)) +
    geom_point()

How would you describe this relationship? Is it surprising or as you would expect? Answer: In general, areas with a higher median income have more expensive houses, but the relationship has a lot of noise.

Now show the relationship between the mean household income and the median house value.

ca %>%
  ggplot(aes(mean_household_income, median_house_value)) +
    geom_point()