Due: 2018-10-23 (start of class)
Draft: 2018-10-18 (three finished graphs; start of class)
Starter code: project-ii.Rmd
Data dictionary: acs-data-dictionary
The overarching goal of this project is to tell an interesting narrative about the demographics of a particular metropolitan area in the United States. The structure of the report is much more open ended compared to the first project.
For this project we will all be working off of the same master dataset. The data gives demographic information about census tracts. You will each, however, be looking at different metropolitan areas in the United States.
Your task is to write a short essay in the style of a 538 news article. The essay should describe one or more interesting elements you discovered while investigating the metropolitan area that you have been assigned. Keep in mind that you will want to draw on one or more of the following tasks from exploratory data analysis:
- anomaly detection: identify areas that seem to behave differently than the rest of the data
- perspective: pick a particular area of interest and compare it to the rest of the data
- pattern recognition: understand the basic patterns present in your dataset
The final report should contain exactly three visualizations. This means that you should take care to make each visualization as information dense as possible. Aim to have a final report around 250-500 words. The word length is not a hard limit; it is just a guidelines to indicate the expected length of the report. All of the plots should be integrated into the essay in a meaningful way rather than all included at the start or end of the essay.
The grade for the assignment depends primarily on the effectiveness of the graphics in conveying information, the quality of the writing, and execution of how the writing and visualizations are integrated together.
You may find it very helpful to make maps of the data from your tract. These are great for exploratory work, but don’t overuse them in the report. To make a nice map, use the ggmap package
library(ggmap) acs_rva <- filter(acs, cbsa == "Richmond, VA") qmplot(lon, lat, data = acs_rva, geom = "blank")
qmplot function replaces the typical
ggplot() function in the first
line of a graphic. You can add other layers just as before. The code here adds
points to the plot (notice that I set the alpha parameter to make sure the
points do not cover up the rest of the plot).
qmplot(lon, lat, data = acs_rva, geom = "blank") + geom_point(aes(color = median_rent), alpha = 0.8) + scale_color_viridis()
You may also find it useful to construct new variables that aggregate the granular ones I have provided. For example, if you want to find what percentage of people have a commute of over 45 minutes you can do this:
acs$ctime_45_plus <- acs$ctime_45_59 + acs$ctime_60_89 + acs$ctime_90_99
The name of the new variable (here,
ctime_45_plus) is entirely up to you.
Also, some students have wanted to create a variable that shows, for each tract, the maximum
category from a group of variables. You can do this by the following code (replace
with the name of your dataset) for the race variables:
temp <- select(acs, starts_with("race_")) acs$max_race_category <- names(temp)[apply(temp, 1, which.max)]
It should be clear how to modify this for other variables (but if not, please ask!).
Many of you, myself included, ran into errors with the ggmap package. If you want to make maps, I suggest running this command in RStudio (just once):
Then, remove this line and restart R. You should be able to make plots as given in the course notes.
Here are some hints if you are still stuck on telling a story:
- Try looking at the household income variables. These are in dollars, not percentages, and often have stronger more obvious relationships to the other variables.
- Try to create a max category for one of the clusters of variables. Try plotting
using categories as colors on a map. If almost all of the points have the same
category, try to instead use the percentage of this max category as a measurement.
For example, if almost all tracts have
race_whiteas the largest category, then try to look at the
race_whitevariable. If the there is a nice mix of categories, try to use the variable directly.
- If you have a variable of interest, try creating a confidence interval (
geom_confint) plot by counties. You could also try using the
max_categories if they are interesting.
- Generally, you don’t want to use multiple variables from the same section (not including
the income variables). Either collapse categories, use the
max_categories trick, or pick just one that interests you.
- You do not need to have three different kinds of plots, but thinking about three types can make the project easier. A map, a scatter plot, and a confidence interval plot can often be used to together to great effect.