Over the next two classes we are going to be looking at survey response data from the Sixteen Personality Factor Questionnaire in order to practice our skills at statistical inference:

`pf <- read_csv("https://statsmaths.github.io/stat_data/cattell_16pf.csv")`

```
## Parsed with column specification:
## cols(
## .default = col_double(),
## age = col_integer(),
## gender = col_character(),
## country = col_character(),
## elapsed = col_integer()
## )
```

`## See spec(...) for full column specifications.`

The dataset use the following fields:

- age: respondent’s age in years
- gender: respondent’s self-selected gender
- country: two letter IATA code for the respondent’s IP
- elapsed: time taken to complete quiz in seconds
- warmth: personality score from 1-20
- reasoning: personality score from 1-20
- emotional_stability: personality score from 1-20
- dominance: personality score from 1-20
- liveliness: personality score from 1-20
- rule_consciousness: personality score from 1-20
- social_boldness: personality score from 1-20
- sensitivity: personality score from 1-20
- vigilance: personality score from 1-20
- abstractedness: personality score from 1-20
- privateness: personality score from 1-20
- apprehension: personality score from 1-20
- openness_to_change: personality score from 1-20
- self_reliance: personality score from 1-20
- perfectionism: personality score from 1-20
- tension: personality score from 1-20
- baseline: average score across all 16 personality traits

*To start with, select a particular trait that you will use for the first bank of questions. I suggest picking something that popped out when you took the test. You can pick any of the 16 other than sensitivity.*

Produce a bar plot of the personality scores for your trait.

```
ggplot(pf, aes(warmth)) +
geom_bar()
```

Describe the distribution. Why does a bar plot work here even though the variable is numeric?

**Answer**: Most values are between 12 and 16, with the most common number being 15.

Produce a confidence interval for the mean of your trait.

```
model <- lm_basic(warmth ~ 1, data = pf)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = warmth ~ 1, data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.0639 -1.0639 -0.0639 0.9361 5.9361
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 14.06 14.05 14.08
##
## Residual standard error: 2.023 on 49059 degrees of freedom
```

Now, produce a dataset that consists only of responses from the country of Hungary. The 2-letter country code for Hungary is “HU”. Produce a confidence interval for your trait on the Hungarian dataset.

```
pf_hu <- filter(pf, country == "HU")
model <- lm_basic(warmth ~ 1, data = pf_hu)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = warmth ~ 1, data = pf_hu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1273 -1.1273 -0.1273 1.8727 3.8727
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 13.13 12.55 13.71
##
## Residual standard error: 2.152 on 54 degrees of freedom
```

What do you notice about this confidence interval compared to the original one? Can you explain why this is the case?

**Answer**: It is much wider because there is less data.

Find a 95% confidence interval for the difference between the average male and female value for your personality trait on the Hungarian data.

```
model <- lm_basic(warmth ~ 1 + gender, data = pf_hu)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = warmth ~ 1 + gender, data = pf_hu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.000 -1.219 0.000 1.781 3.781
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 13.2188 12.4495 13.988
## gendermale -0.2188 -1.4083 0.971
##
## Residual standard error: 2.17 on 53 degrees of freedom
## Multiple R-squared: 0.00256, Adjusted R-squared: -0.01626
## F-statistic: 0.136 on 1 and 53 DF, p-value: 0.7137
```

**Answer**: The difference on the Hungarian dataset is between -1.4083 and 0.971.

Is there a statistically significant difference between men and women’s average score? If so, what direction is this difference? Does this challenge or confirm traditional gender stereotypes (note: not all personality traits have one)?

**Answer**: No, there is no significant difference.

Construct a new dataset that only has ages from 30-49 (from the original `pf`

, not just the Hungarian subset). Hint: you can use the filter function twice.

`pf_ages <- filter(pf, age >= 30, age <= 49)`

On the dataset of people ages 30 to 49, create a variable called `fourties`

if age is greater than or equal to 40.

`pf_ages$fourties <- pf_ages$age >= 40`

Fit a regression on with your personality trait as a response to test for the difference between its mean for people in their 30s versus people in their 40s. Compute a confidence interval for this difference.

```
model <- lm_basic(warmth ~ fourties, data = pf_ages)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = warmth ~ fourties, data = pf_ages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1476 -1.0494 -0.0494 0.9506 5.9506
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 14.04940 13.99654 14.102
## fourtiesTRUE 0.09818 0.01187 0.184
##
## Residual standard error: 2.098 on 9683 degrees of freedom
## Multiple R-squared: 0.0005132, Adjusted R-squared: 0.00041
## F-statistic: 4.972 on 1 and 9683 DF, p-value: 0.02578
```

Is there a statistically significant difference between 30s and 40s average scores? If so, what direction is this difference? Does this challenge or confirm traditional age stereotypes (note: not all personality traits have one)?

**Answer**: Yes, people in the 40s have a statistically significant higher warmth score.

Draw a bar plot of the variable `country`

over the whole dataset. Take note of the number of countries with a very small number of responses.

```
ggplot(pf, aes(country)) +
geom_bar()
```

Fitting a model with all of these countries is possible but not very useful. When we only have a few responses from some places it makes those few regions with a lot of data harding to identify and analyze. Fortunately there is a solution in the function `fct_lump`

.

Fit a regression model on the entire dataset by the country variable lumped into 5 categories.

```
model <- lm_basic(warmth ~ 1 + fct_lump(country, 5), data = pf)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = warmth ~ 1 + fct_lump(country, 5), data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1934 -1.0816 0.1697 1.0406 6.1697
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 14.19345 14.11485 14.272
## fct_lump(country, 5)CA -0.08515 -0.19517 0.025
## fct_lump(country, 5)GB -0.11188 -0.20819 -0.016
## fct_lump(country, 5)IN -0.23405 -0.33889 -0.129
## fct_lump(country, 5)US -0.02413 -0.10679 0.059
## fct_lump(country, 5)Other -0.36311 -0.44985 -0.276
##
## Residual standard error: 2.018 on 49037 degrees of freedom
## (17 observations deleted due to missingness)
## Multiple R-squared: 0.004908, Adjusted R-squared: 0.004806
## F-statistic: 48.37 on 5 and 49037 DF, p-value: < 2.2e-16
```

Using the previous model, which country (not including “Other”) has the highest score for your trait? Which has the lowest?

**Answer**: Australia, the baseline country.

Fit a model that predicts sensitivity as a function of the baseline score.

```
model <- lm_basic(sensitivity ~ 1 + baseline, data = pf)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = sensitivity ~ 1 + baseline, data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3533 -0.8905 0.0477 0.9890 6.4613
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 0.16645 0.03733 0.296
## baseline 0.93818 0.92830 0.948
##
## Residual standard error: 1.483 on 49058 degrees of freedom
## Multiple R-squared: 0.4139, Adjusted R-squared: 0.4139
## F-statistic: 3.465e+04 on 1 and 49058 DF, p-value: < 2.2e-16
```

Is the slope statistically significantly different from 1? Why is this an interesting question in the context of the data?

**Answer**: Yes, it is significantly different from 1. The range is from 0.92830 to 0.948.

Fit a regression model that uses both gender and the baseline score to predict your trait.

```
model <- lm_basic(sensitivity ~ 1 + baseline + gender, data = pf)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = sensitivity ~ 1 + baseline + gender, data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4486 -0.8698 0.0270 0.9565 6.3742
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 0.5419 0.4136 0.670
## baseline 0.9260 0.9163 0.936
## gendermale -0.5495 -0.5760 -0.523
##
## Residual standard error: 1.459 on 49057 degrees of freedom
## Multiple R-squared: 0.4331, Adjusted R-squared: 0.4331
## F-statistic: 1.874e+04 on 2 and 49057 DF, p-value: < 2.2e-16
```

Describe the slope for the gender term in words:

**Answer**: Its gives the extra amount of the sensitivity score for males (here, a decrease because of the negative sign) after accounting for the baseline.

Add predictions from the previous model to `pf`

.

`pf <- add_prediction(pf, model)`

Plot the baseline score as a function of `model_pred`

, coloring the points base on the gender variable.

```
ggplot(pf, aes(model_pred, baseline)) +
geom_point(aes(color = gender))
```

What do the predicted values look like?

**Answer**: These are two parallel lines, just as we would expect given the model.