Below you will find several empty R code scripts and answer prompts. Your task is to fill in the required code snippets and answer the corresponding questions.

Today we are going to look at a fairly largerdataset. Each row of the data refers to a single reported crime in the City of Chicago:

The available variable are:

`area_number`

: the community area code of the crime; a number from 1-77`arrest_flag`

: whether the crime resulted in an arrest; 0 is false and 1 is true`domestic_flag`

: whether the crime is classified as a domestic offense; 0 is false and 1 is true`night_flag`

: did the crime occur at night (9pm - 3am); 0 is false and 1 is true`burglary`

: was the crime classified as a burglary? 0 is false and 1 is true`theft`

: was the crime classified as a theft? 0 is false and 1 is true`battery`

: was the crime classified as a battery? 0 is false and 1 is true`damage`

: was the crime classified as a damage? 0 is false and 1 is true`assault`

: was the crime classified as an assault? 0 is false and 1 is true`deception`

: was the crime classified as criminal deception? 0 is false and 1 is true`robbery`

: was the crime classified as a robbery? 0 is false and 1 is true`narcotics`

: was the crime classified as a narcotics violation? 0 is false and 1 is true

We also have metadata about each community area within Chicago as well. We will see how to use these shortly.

`area_number`

: the community area code; a number from 1 to 77`area_name`

: popular name of the community area`median_age`

: the median age of all residents in the community area`num_households`

: total number of households`family_households`

: percentage of households classified as a `familyâ€™ (domestic partners, married couples, and one or more parents with children)`family_w_kids`

: percentage of households with children under the age of 18`owner_ratio`

: ratio of households that own or mortgage their primary residence`mean_travel_time`

: average commute time`percent_walk`

: percentage of commuters who walk to work (0-100)`median_income`

: median household income`perc_20_units`

: percentage of residential buildings with 20 or more units

It is difficult to do much of anything directly with the raw data. We need to utilize the group_summarize function to get somewhere interesting. Before doing that on the whole dataset, letâ€™s make sure that we understand exactly what is going on by using the mean() and sum() functions directly. Take the mean of arrest_flag for the whole dataset:

```
## # A tibble: 1 x 1
## arrest_flag_mean
## <dbl>
## 1 0.172
```

Describe what this means in words.

**Answer**: 17.2 percent of crimes in the dataset result in an arrest

Now, take the mean of the theft variable over the entire dataset:

```
## # A tibble: 1 x 1
## theft_mean
## <dbl>
## 1 0.275
```

Describe what this means in words.

**Answer**: 27.5 percent of the crimes in the dataset are thefts

Take the sum of the theft variable:

```
## # A tibble: 1 x 1
## theft_sum
## <dbl>
## 1 38816
```

Describe what this means in words.

**Answer**: There are 38816 thefts in the datasets.

Take the dataset `ca`

and calculate the sum of the variable `num_households`

:

```
## # A tibble: 1 x 1
## num_households_sum
## <dbl>
## 1 1061928
```

Divide the sum of the theft variable by the sum of the number of households variable and multiply by 1000 (Note: you may need to do this manually by copy and pasting).

`## [1] 36.55238`

Describe what this means in words.

**Answer**: There are 36 thefts per 1000 households in this dataset.

Use the filter function to construct a dataset `temp`

consisting only of those rows in `crimes`

that come from area_number 23. This is the area named Humboldt Park.

Take the mean of the variable arrests on the data `temp`

.

```
## # A tibble: 1 x 1
## arrest_flag_mean
## <dbl>
## 1 0.257
```

Is this smaller, larger, or about the same as the mean of the arrest flag over the entire dataset? Can we safely compare these measurements? If so describe the relationship in words.

**Answer**: Itâ€™s larger. Yes, we can compare these because they are both rates.

Take the mean of the variable `theft`

on the data `temp`

.

```
## # A tibble: 1 x 1
## theft_mean
## <dbl>
## 1 0.188
```

Is this smaller, larger, or about the same as the mean of the theft flag over the entire dataset? Can we safely compare these measurements? If so describe the relationship in words.

**Answer**: It is larger.

Manually look up the number of households in area 23, Humboldt Park, by looking at the dataset `ca`

in the data viewer. Take the sum of the number of thefts in temp, divide this by the number of households in Humboldt Park, and multiply by 1000:

```
## # A tibble: 1 x 1
## num_households
## <dbl>
## 1 17830
```

```
## # A tibble: 1 x 1
## theft_mean
## <dbl>
## 1 44.1
```

Is this smaller, larger, or about the same as the same measurement over the over the entire dataset? Can we safely compare these measurements? If so describe the relationship in words.

**Answer**: Larger! And yes, again these are again rates and therefore can be compared.

Now that we have an idea of what it means to take means and sums over this dataset, use `group_summarize`

to summarize `crimes`

at the community area level. Save the result as `crimes_ca`

:

We want to combine the datasets `crimes_ca`

and `ca`

. To do this we use a new function `left_join`

, as follows:

What variable is R using the match these datasets up?

**Answer**: area_number

By default, R will use any commonly named variables to match the datasets up. If we need it, I will show you how to modify this behavior later.

Construct a variable `theft_rate`

equal to the number of thefts in each community area, divided by the number of households and multiplied by 1000.

Draw a scatter plot with median income on the x-axis and theft rate on the y-axis.