Once again we will load the survey response data from the Sixteen Personality Factor Questionnaire in order to practice our skills at statistical inference:

`pf <- read_csv("https://statsmaths.github.io/stat_data/cattell_16pf.csv")`

```
## Parsed with column specification:
## cols(
## .default = col_double(),
## age = col_integer(),
## gender = col_character(),
## country = col_character(),
## elapsed = col_integer()
## )
```

`## See spec(...) for full column specifications.`

The dataset use the following fields: - age: respondent’s age in years - gender: respondent’s self-selected gender - country: two letter IATA code for the respondent’s IP - elapsed: time taken to complete quiz in seconds - warmth: personality score from 1-20 - reasoning: personality score from 1-20 - emotional_stability: personality score from 1-20 - dominance: personality score from 1-20 - liveliness: personality score from 1-20 - rule_consciousness: personality score from 1-20 - social_boldness: personality score from 1-20 - sensitivity: personality score from 1-20 - vigilance: personality score from 1-20 - abstractedness: personality score from 1-20 - privateness: personality score from 1-20 - apprehension: personality score from 1-20 - openness_to_change: personality score from 1-20 - self_reliance: personality score from 1-20 - perfectionism: personality score from 1-20 - tension: personality score from 1-20 - baseline: average score across all 16 personality traits

Use the `percentiles`

function to detect the 3th and 97th percentiles for the amount of elapsed time taken to complete the quiz.

`percentiles(pf$elapsed)`

```
## 0% 1% 2% 3% 4% 5%
## 5.00 178.59 315.18 369.00 394.00 413.00
## 6% 7% 8% 9% 10% 11%
## 427.00 441.00 452.00 463.00 473.00 482.00
## 12% 13% 14% 15% 16% 17%
## 491.00 499.00 508.00 516.00 523.00 530.00
## 18% 19% 20% 21% 22% 23%
## 537.00 544.00 551.00 557.00 565.00 571.00
## 24% 25% 26% 27% 28% 29%
## 578.00 584.00 591.00 598.00 604.52 611.00
## 30% 31% 32% 33% 34% 35%
## 617.00 624.00 630.00 637.00 643.00 650.00
## 36% 37% 38% 39% 40% 41%
## 656.00 663.00 669.00 676.00 683.00 690.00
## 42% 43% 44% 45% 46% 47%
## 697.00 704.00 712.00 719.00 726.00 733.00
## 48% 49% 50% 51% 52% 53%
## 741.00 747.00 754.00 762.00 770.00 779.00
## 54% 55% 56% 57% 58% 59%
## 786.00 795.00 804.00 812.00 822.00 831.00
## 60% 61% 62% 63% 64% 65%
## 841.00 852.00 862.00 873.00 883.00 895.00
## 66% 67% 68% 69% 70% 71%
## 907.00 919.00 933.00 947.00 962.00 977.00
## 72% 73% 74% 75% 76% 77%
## 992.00 1008.07 1025.00 1045.00 1066.00 1090.00
## 78% 79% 80% 81% 82% 83%
## 1114.00 1138.00 1168.00 1201.00 1231.00 1266.00
## 84% 85% 86% 87% 88% 89%
## 1306.00 1348.00 1393.74 1453.00 1520.00 1594.51
## 90% 91% 92% 93% 94% 95%
## 1682.00 1797.00 1924.00 2094.87 2314.00 2662.05
## 96% 97% 98% 99% 100%
## 3123.00 4107.46 6450.40 16610.85 8534589.00
```

**Answer**: 369.00 seconds and 4107.46 seconds

We want to construct a new variable called speed that tells us whether someone was a fast test taker (less than the 3rd percentile), a slow test taker (greater than the 97th percentile), or a normal test taker. To do so, I’ll use some code that we have not seen in this form before. If the 3rd percentile was 100s and the 97th was 1000s, it would look like this:

```
pf$speed <- "normal"
pf$speed[pf$elapsed < 369.00] <- "fast"
pf$speed[pf$elapsed > 4107.46] <- "slow"
```

Modify the code above use the cutoffs you found in the previous question (and make sure you run it).

Fit a linear model predicting the perfectionism variable as a function of the speed variable.

```
model <- lm_basic(perfectionism ~ 1 + speed, data = pf)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = perfectionism ~ 1 + speed, data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6112 -0.6112 0.3888 1.3888 10.0844
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 9.916 9.820 10.011
## speednormal 3.696 3.598 3.793
## speedslow 3.683 3.548 3.818
##
## Residual standard error: 1.871 on 49057 degrees of freedom
## Multiple R-squared: 0.1018, Adjusted R-squared: 0.1018
## F-statistic: 2781 on 2 and 49057 DF, p-value: < 2.2e-16
```

How does the perfectionism score differ between the three groups? Are both slopes significant?

**Answer**: There is a significant difference between normal and fast as well as slow and fast. Both normal and slow have higher scores than the fast group.

In the last question it should have appeared that the perfectionism score for the slow and normal groups were very similar. In the regression as given there is no way to test whether these two groups have a statistically significant difference because the slopes only relate to the baseline level (here, “fast”).

It is possible to change the baseline level of a categorical variable in R. To do so, use the function `fct_relevel`

; its second argument gives the desired baseline level:

```
model <- lm_basic(perfectionism ~ 1 + fct_relevel(speed, "normal"), data = pf)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = perfectionism ~ 1 + fct_relevel(speed, "normal"),
## data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6112 -0.6112 0.3888 1.3888 10.0844
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 13.61124 13.59416 13.628
## fct_relevel(speed, "normal")fast -3.69559 -3.79275 -3.598
## fct_relevel(speed, "normal")slow -0.01273 -0.10983 0.084
##
## Residual standard error: 1.871 on 49057 degrees of freedom
## Multiple R-squared: 0.1018, Adjusted R-squared: 0.1018
## F-statistic: 2781 on 2 and 49057 DF, p-value: < 2.2e-16
```

Using this table, is there a statistically significant difference between normal and slow users in their perfectionism score?

**Answer**: No, the appear very similar.

Based on the result in the previous describe a plausible reason and a some people may finish quickly and a plausible reason some may take a long time. Hint: Converting the 97th percentile to hours may give you some ideas for the last part.

**Answer**: Likely the slow finishers did not just “take their time”, but rather forgot to submit the quiz, left it open on their browser, and submitted it days later. It may also just be a data error.

Take your personality trait from last class and select the one trait that most closely contrasts with your trait and one that most closely matches your trait. Fit a linear regression model that predicts your trait as a function of both of these as well as the baseline variable.

```
model <- lm_basic(warmth ~ 1 + baseline + dominance + sensitivity, data = pf)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = warmth ~ 1 + baseline + dominance + sensitivity,
## data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2078 -0.9110 0.1342 1.0305 5.9881
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 1.041272 0.908360 1.174
## baseline 1.093063 1.075893 1.110
## dominance -0.009831 -0.021179 0.002
## sensitivity -0.086008 -0.095105 -0.077
##
## Residual standard error: 1.517 on 49056 degrees of freedom
## Multiple R-squared: 0.4382, Adjusted R-squared: 0.4382
## F-statistic: 1.275e+04 on 3 and 49056 DF, p-value: < 2.2e-16
```

Interpret the significance and signs of the slopes in the previous model for the two traits that you selected. Do their signs match what you would have expected?

**Answer**: The dominance score is not significant and the sensitivity score is negative. I had expected sensitivity to be positively related to warmth.

Take the previous model and add the variables `speed`

, `gender`

, and `country`

, the latter lumped into 5 categories.

```
model <- lm_basic(warmth ~ 1 + baseline + dominance + sensitivity + gender +
speed + fct_lump(country, 5), data = pf)
reg_table(model, level = 0.95)
```

```
##
## Call:
## lm_basic(formula = warmth ~ 1 + baseline + dominance + sensitivity +
## gender + speed + fct_lump(country, 5), data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1401 -0.8749 0.1334 1.0036 6.3684
##
## Coefficients:
## Estimate 2.5 % 97.5 %
## (Intercept) 1.054501 0.907847 1.201
## baseline 1.050171 1.032617 1.068
## dominance -0.003769 -0.015028 0.007
## sensitivity -0.097655 -0.106813 -0.088
## gendermale -0.278292 -0.306115 -0.250
## speednormal 0.833546 0.748349 0.919
## speedslow 0.808494 0.694915 0.922
## fct_lump(country, 5)CA 0.015298 -0.066418 0.097
## fct_lump(country, 5)GB 0.014884 -0.056745 0.087
## fct_lump(country, 5)IN -0.155583 -0.233710 -0.077
## fct_lump(country, 5)US -0.009919 -0.071326 0.051
## fct_lump(country, 5)Other -0.314191 -0.378623 -0.250
##
## Residual standard error: 1.498 on 49031 degrees of freedom
## (17 observations deleted due to missingness)
## Multiple R-squared: 0.4518, Adjusted R-squared: 0.4516
## F-statistic: 3673 on 11 and 49031 DF, p-value: < 2.2e-16
```

Does this change the slopes for your two traits much?

**Answer**: The numbers are slightly different, but the significances are the same. Dominance is not significant and sensitivity is and negatively correlated.

Summarize the previous model (specifically the relationship between the three traits) in words.

**Answer**: After controlling for the baseline score, speed, gender, and country, sensitivity is negatively correlated with warmth and there is no statistically significant evidence that dominance is related to warmth.