Instructions

Below you will find several empty R code scripts and answer prompts. Your task is to fill in the required code snippets and answer the corresponding questions.

Chicago Crime Data

Today we are going to look at a fairly largerdataset. Each row of the data refers to a single reported crime in the City of Chicago:

The available variable are:

We also have metadata about each community area within Chicago as well. We will see how to use these shortly.

It is difficult to do much of anything directly with the raw data. We need to utilize the group_summarize function to get somewhere interesting. Before doing that on the whole dataset, let’s make sure that we understand exactly what is going on by using the mean() and sum() functions directly. Take the mean of arrest_flag for the whole dataset:

## # A tibble: 1 x 1
##   arrest_flag_mean
##              <dbl>
## 1            0.172

Describe what this means in words.

Answer: 17.2 percent of crimes in the dataset result in an arrest

Now, take the mean of the theft variable over the entire dataset:

## # A tibble: 1 x 1
##   theft_mean
##        <dbl>
## 1      0.275

Describe what this means in words.

Answer: 27.5 percent of the crimes in the dataset are thefts

Take the sum of the theft variable:

## # A tibble: 1 x 1
##   theft_sum
##       <dbl>
## 1     38816

Describe what this means in words.

Answer: There are 38816 thefts in the datasets.

Take the dataset ca and calculate the sum of the variable num_households:

## # A tibble: 1 x 1
##   num_households_sum
##                <dbl>
## 1            1061928

Divide the sum of the theft variable by the sum of the number of households variable and multiply by 1000 (Note: you may need to do this manually by copy and pasting).

## [1] 36.55238

Describe what this means in words.

Answer: There are 36 thefts per 1000 households in this dataset.

Use the filter function to construct a dataset temp consisting only of those rows in crimes that come from area_number 23. This is the area named Humboldt Park.

Take the mean of the variable arrests on the data temp.

## # A tibble: 1 x 1
##   arrest_flag_mean
##              <dbl>
## 1            0.257

Is this smaller, larger, or about the same as the mean of the arrest flag over the entire dataset? Can we safely compare these measurements? If so describe the relationship in words.

Answer: It’s larger. Yes, we can compare these because they are both rates.

Take the mean of the variable theft on the data temp.

## # A tibble: 1 x 1
##   theft_mean
##        <dbl>
## 1      0.188

Is this smaller, larger, or about the same as the mean of the theft flag over the entire dataset? Can we safely compare these measurements? If so describe the relationship in words.

Answer: It is larger.

Manually look up the number of households in area 23, Humboldt Park, by looking at the dataset ca in the data viewer. Take the sum of the number of thefts in temp, divide this by the number of households in Humboldt Park, and multiply by 1000:

## # A tibble: 1 x 1
##   num_households
##            <dbl>
## 1          17830
## # A tibble: 1 x 1
##   theft_mean
##        <dbl>
## 1       44.1

Is this smaller, larger, or about the same as the same measurement over the over the entire dataset? Can we safely compare these measurements? If so describe the relationship in words.

Answer: Larger! And yes, again these are again rates and therefore can be compared.

Now that we have an idea of what it means to take means and sums over this dataset, use group_summarize to summarize crimes at the community area level. Save the result as crimes_ca:

We want to combine the datasets crimes_ca and ca. To do this we use a new function left_join, as follows:

What variable is R using the match these datasets up?

Answer: area_number

By default, R will use any commonly named variables to match the datasets up. If we need it, I will show you how to modify this behavior later.

Construct a variable theft_rate equal to the number of thefts in each community area, divided by the number of households and multiplied by 1000.

Draw a scatter plot with median income on the x-axis and theft rate on the y-axis.