Sixteen Personality Factors: Pick Your Trait

Over the next two classes we are going to be looking at survey response data from the Sixteen Personality Factor Questionnaire in order to practice our skills at statistical inference:

pf <- read_csv("https://statsmaths.github.io/stat_data/cattell_16pf.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   age = col_integer(),
##   gender = col_character(),
##   country = col_character(),
##   elapsed = col_integer()
## )
## See spec(...) for full column specifications.

The dataset use the following fields:

To start with, select a particular trait that you will use for the first bank of questions. I suggest picking something that popped out when you took the test. You can pick any of the 16 other than sensitivity.

Produce a bar plot of the personality scores for your trait.

ggplot(pf, aes(warmth)) +
  geom_bar()

Describe the distribution. Why does a bar plot work here even though the variable is numeric?

Answer: Most values are between 12 and 16, with the most common number being 15.

Produce a confidence interval for the mean of your trait.

model <- lm_basic(warmth ~ 1, data = pf)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = warmth ~ 1, data = pf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0639  -1.0639  -0.0639   0.9361   5.9361 
## 
## Coefficients:
##             Estimate 2.5 % 97.5 %
## (Intercept)    14.06 14.05  14.08
## 
## Residual standard error: 2.023 on 49059 degrees of freedom

Now, produce a dataset that consists only of responses from the country of Hungary. The 2-letter country code for Hungary is “HU”. Produce a confidence interval for your trait on the Hungarian dataset.

pf_hu <- filter(pf, country == "HU")
model <- lm_basic(warmth ~ 1, data = pf_hu)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = warmth ~ 1, data = pf_hu)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1273 -1.1273 -0.1273  1.8727  3.8727 
## 
## Coefficients:
##             Estimate 2.5 % 97.5 %
## (Intercept)    13.13 12.55  13.71
## 
## Residual standard error: 2.152 on 54 degrees of freedom

What do you notice about this confidence interval compared to the original one? Can you explain why this is the case?

Answer: It is much wider because there is less data.

Find a 95% confidence interval for the difference between the average male and female value for your personality trait on the Hungarian data.

model <- lm_basic(warmth ~ 1 + gender, data = pf_hu)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = warmth ~ 1 + gender, data = pf_hu)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.000 -1.219  0.000  1.781  3.781 
## 
## Coefficients:
##             Estimate   2.5 % 97.5 %
## (Intercept)  13.2188 12.4495 13.988
## gendermale   -0.2188 -1.4083  0.971
## 
## Residual standard error: 2.17 on 53 degrees of freedom
## Multiple R-squared:  0.00256,    Adjusted R-squared:  -0.01626 
## F-statistic: 0.136 on 1 and 53 DF,  p-value: 0.7137

Answer: The difference on the Hungarian dataset is between -1.4083 and 0.971.

Is there a statistically significant difference between men and women’s average score? If so, what direction is this difference? Does this challenge or confirm traditional gender stereotypes (note: not all personality traits have one)?

Answer: No, there is no significant difference.

Construct a new dataset that only has ages from 30-49 (from the original pf, not just the Hungarian subset). Hint: you can use the filter function twice.

pf_ages <- filter(pf, age >= 30, age <= 49)

On the dataset of people ages 30 to 49, create a variable called fourties if age is greater than or equal to 40.

pf_ages$fourties <- pf_ages$age >= 40

Fit a regression on with your personality trait as a response to test for the difference between its mean for people in their 30s versus people in their 40s. Compute a confidence interval for this difference.

model <- lm_basic(warmth ~ fourties, data = pf_ages)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = warmth ~ fourties, data = pf_ages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1476  -1.0494  -0.0494   0.9506   5.9506 
## 
## Coefficients:
##              Estimate    2.5 % 97.5 %
## (Intercept)  14.04940 13.99654 14.102
## fourtiesTRUE  0.09818  0.01187  0.184
## 
## Residual standard error: 2.098 on 9683 degrees of freedom
## Multiple R-squared:  0.0005132,  Adjusted R-squared:  0.00041 
## F-statistic: 4.972 on 1 and 9683 DF,  p-value: 0.02578

Is there a statistically significant difference between 30s and 40s average scores? If so, what direction is this difference? Does this challenge or confirm traditional age stereotypes (note: not all personality traits have one)?

Answer: Yes, people in the 40s have a statistically significant higher warmth score.

Draw a bar plot of the variable country over the whole dataset. Take note of the number of countries with a very small number of responses.

ggplot(pf, aes(country)) +
  geom_bar()

Fitting a model with all of these countries is possible but not very useful. When we only have a few responses from some places it makes those few regions with a lot of data harding to identify and analyze. Fortunately there is a solution in the function fct_lump.

Fit a regression model on the entire dataset by the country variable lumped into 5 categories.

model <- lm_basic(warmth ~ 1 + fct_lump(country, 5), data = pf)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = warmth ~ 1 + fct_lump(country, 5), data = pf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1934  -1.0816   0.1697   1.0406   6.1697 
## 
## Coefficients:
##                           Estimate    2.5 % 97.5 %
## (Intercept)               14.19345 14.11485 14.272
## fct_lump(country, 5)CA    -0.08515 -0.19517  0.025
## fct_lump(country, 5)GB    -0.11188 -0.20819 -0.016
## fct_lump(country, 5)IN    -0.23405 -0.33889 -0.129
## fct_lump(country, 5)US    -0.02413 -0.10679  0.059
## fct_lump(country, 5)Other -0.36311 -0.44985 -0.276
## 
## Residual standard error: 2.018 on 49037 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.004908,   Adjusted R-squared:  0.004806 
## F-statistic: 48.37 on 5 and 49037 DF,  p-value: < 2.2e-16

Using the previous model, which country (not including “Other”) has the highest score for your trait? Which has the lowest?

Answer: Australia, the baseline country.

Fitting Sensitivity Scores

Fit a model that predicts sensitivity as a function of the baseline score.

model <- lm_basic(sensitivity ~ 1 + baseline, data = pf)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = sensitivity ~ 1 + baseline, data = pf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3533 -0.8905  0.0477  0.9890  6.4613 
## 
## Coefficients:
##             Estimate   2.5 % 97.5 %
## (Intercept)  0.16645 0.03733  0.296
## baseline     0.93818 0.92830  0.948
## 
## Residual standard error: 1.483 on 49058 degrees of freedom
## Multiple R-squared:  0.4139, Adjusted R-squared:  0.4139 
## F-statistic: 3.465e+04 on 1 and 49058 DF,  p-value: < 2.2e-16

Is the slope statistically significantly different from 1? Why is this an interesting question in the context of the data?

Answer: Yes, it is significantly different from 1. The range is from 0.92830 to 0.948.

Fit a regression model that uses both gender and the baseline score to predict your trait.

model <- lm_basic(sensitivity ~ 1 + baseline + gender, data = pf)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = sensitivity ~ 1 + baseline + gender, data = pf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4486 -0.8698  0.0270  0.9565  6.3742 
## 
## Coefficients:
##             Estimate   2.5 % 97.5 %
## (Intercept)   0.5419  0.4136  0.670
## baseline      0.9260  0.9163  0.936
## gendermale   -0.5495 -0.5760 -0.523
## 
## Residual standard error: 1.459 on 49057 degrees of freedom
## Multiple R-squared:  0.4331, Adjusted R-squared:  0.4331 
## F-statistic: 1.874e+04 on 2 and 49057 DF,  p-value: < 2.2e-16

Describe the slope for the gender term in words:

Answer: Its gives the extra amount of the sensitivity score for males (here, a decrease because of the negative sign) after accounting for the baseline.

Add predictions from the previous model to pf.

pf <- add_prediction(pf, model)

Plot the baseline score as a function of model_pred, coloring the points base on the gender variable.

ggplot(pf, aes(model_pred, baseline)) +
  geom_point(aes(color = gender))

What do the predicted values look like?

Answer: These are two parallel lines, just as we would expect given the model.