Below you will find several empty R code scripts and answer prompts. Your task is to fill in the required code snippets and answer the corresponding questions.
Today we are going to look at a fairly largerdataset. Each row of the data refers to a single reported crime in the City of Chicago:
The available variable are:
area_number
: the community area code of the crime; a number from 1-77arrest_flag
: whether the crime resulted in an arrest; 0 is false and 1 is truedomestic_flag
: whether the crime is classified as a domestic offense; 0 is false and 1 is truenight_flag
: did the crime occur at night (9pm - 3am); 0 is false and 1 is trueburglary
: was the crime classified as a burglary? 0 is false and 1 is truetheft
: was the crime classified as a theft? 0 is false and 1 is truebattery
: was the crime classified as a battery? 0 is false and 1 is truedamage
: was the crime classified as a damage? 0 is false and 1 is trueassault
: was the crime classified as an assault? 0 is false and 1 is truedeception
: was the crime classified as criminal deception? 0 is false and 1 is truerobbery
: was the crime classified as a robbery? 0 is false and 1 is truenarcotics
: was the crime classified as a narcotics violation? 0 is false and 1 is trueWe also have metadata about each community area within Chicago as well. We will see how to use these shortly.
area_number
: the community area code; a number from 1 to 77area_name
: popular name of the community areamedian_age
: the median age of all residents in the community areanum_households
: total number of householdsfamily_households
: percentage of households classified as a `family’ (domestic partners, married couples, and one or more parents with children)family_w_kids
: percentage of households with children under the age of 18owner_ratio
: ratio of households that own or mortgage their primary residencemean_travel_time
: average commute timepercent_walk
: percentage of commuters who walk to work (0-100)median_income
: median household incomeperc_20_units
: percentage of residential buildings with 20 or more unitsIt is difficult to do much of anything directly with the raw data. We need to utilize the group_summarize function to get somewhere interesting. Before doing that on the whole dataset, let’s make sure that we understand exactly what is going on by using the mean() and sum() functions directly. Take the mean of arrest_flag for the whole dataset:
## # A tibble: 1 x 1
## arrest_flag_mean
## <dbl>
## 1 0.172
Describe what this means in words.
Answer: 17.2 percent of crimes in the dataset result in an arrest
Now, take the mean of the theft variable over the entire dataset:
## # A tibble: 1 x 1
## theft_mean
## <dbl>
## 1 0.275
Describe what this means in words.
Answer: 27.5 percent of the crimes in the dataset are thefts
Take the sum of the theft variable:
## # A tibble: 1 x 1
## theft_sum
## <dbl>
## 1 38816
Describe what this means in words.
Answer: There are 38816 thefts in the datasets.
Take the dataset ca
and calculate the sum of the variable num_households
:
## # A tibble: 1 x 1
## num_households_sum
## <dbl>
## 1 1061928
Divide the sum of the theft variable by the sum of the number of households variable and multiply by 1000 (Note: you may need to do this manually by copy and pasting).
## [1] 36.55238
Describe what this means in words.
Answer: There are 36 thefts per 1000 households in this dataset.
Use the filter function to construct a dataset temp
consisting only of those rows in crimes
that come from area_number 23. This is the area named Humboldt Park.
Take the mean of the variable arrests on the data temp
.
## # A tibble: 1 x 1
## arrest_flag_mean
## <dbl>
## 1 0.257
Is this smaller, larger, or about the same as the mean of the arrest flag over the entire dataset? Can we safely compare these measurements? If so describe the relationship in words.
Answer: It’s larger. Yes, we can compare these because they are both rates.
Take the mean of the variable theft
on the data temp
.
## # A tibble: 1 x 1
## theft_mean
## <dbl>
## 1 0.188
Is this smaller, larger, or about the same as the mean of the theft flag over the entire dataset? Can we safely compare these measurements? If so describe the relationship in words.
Answer: It is larger.
Manually look up the number of households in area 23, Humboldt Park, by looking at the dataset ca
in the data viewer. Take the sum of the number of thefts in temp, divide this by the number of households in Humboldt Park, and multiply by 1000:
## # A tibble: 1 x 1
## num_households
## <dbl>
## 1 17830
## # A tibble: 1 x 1
## theft_mean
## <dbl>
## 1 44.1
Is this smaller, larger, or about the same as the same measurement over the over the entire dataset? Can we safely compare these measurements? If so describe the relationship in words.
Answer: Larger! And yes, again these are again rates and therefore can be compared.
Now that we have an idea of what it means to take means and sums over this dataset, use group_summarize
to summarize crimes
at the community area level. Save the result as crimes_ca
:
We want to combine the datasets crimes_ca
and ca
. To do this we use a new function left_join
, as follows:
What variable is R using the match these datasets up?
Answer: area_number
By default, R will use any commonly named variables to match the datasets up. If we need it, I will show you how to modify this behavior later.
Construct a variable theft_rate
equal to the number of thefts in each community area, divided by the number of households and multiplied by 1000.
Draw a scatter plot with median income on the x-axis and theft rate on the y-axis.