The primary objective of this project is to answer the question: **What factors are useful for predicting the complexity of Shakespeare’s plays?** William Shakespeare’s plays are traditionally divided into the three genres of **comedy**, **tragedy**, and **history plays**. Each of the genres has some very distinct features. Comedies generally have lighter themes with simple love plots and happy endings. Tragedies are very serious and depict the tragic fate of numerous characters. History plays are also rather complex, as they need to accurately depict the zeitgeist.

In the context of this project, I operationalize *complexity* as the number of clusters. Clustering is a classification of data objects - here, characters in Shakespeare’s plays - into similarity groups according to a distance measure. The provided data includes two distance measures: time and speech. I compare the effects of choosing either of the two measures. I believe that clustering is an appropriate measure of complexity of Shakespeare’s plays, because as the number of distinct groups among the character increases, we can assume a greater complexity of the story line (an increase in the number of plots).

Consequently, I will use quantitative data available on Shakespeare’s plays to argue the following two points: (1) Number of characters can be used as a predictor of the complexity of the play. The higher the number of characters in a play, the greater its complexity. (2) The genre of the play is useful in predicting the complexity of a Shakespearean play. Tragedies are the most complex, followed by historical plays, followed by comedies.

id | title | genre | characters | clusters_time | clusters_speech |
---|---|---|---|---|---|

1 | a_midsummer_nights_dream | comedy | 25 | 3 | 3 |

2 | alls_well_that_ends_well | comedy | 23 | 3 | 3 |

3 | anthony_and_cleopatra | tragedy | 54 | 7 | 7 |

4 | as_you_like_it | comedy | 24 | 7 | 7 |

5 | coriolanus | tragedy | 58 | 10 | 10 |

6 | cymbeline | tragedy | 37 | 9 | 9 |

7 | hamlet_prince_of_denmark | tragedy | 35 | 5 | 5 |

8 | henry_iv_part_1 | history | 36 | 4 | 4 |

9 | henry_iv_part_2 | history | 46 | 6 | 6 |

10 | henry_v | history | 42 | 6 | 6 |

11 | henry_vi_part_1 | history | 55 | 5 | 5 |

12 | henry_vi_part_2 | history | 62 | 10 | 10 |

13 | henry_vi_part_3 | history | 43 | 5 | 5 |

14 | henry_viii | history | 43 | 5 | 5 |

15 | julius_caesar | tragedy | 45 | 5 | 5 |

16 | king_lear | tragedy | 26 | 7 | 7 |

17 | loves_labours_lost | comedy | 18 | 2 | 2 |

18 | macbeth | tragedy | 39 | 6 | 6 |

19 | measure_for_measure | comedy | 24 | 6 | 5 |

20 | much_ado_about_nothing | comedy | 23 | 2 | 2 |

21 | othello_the_moor_of_venice | tragedy | 24 | 2 | 2 |

22 | pericles_prince_of_tyre | history | 43 | 8 | 8 |

23 | richard_ii | history | 35 | 10 | 10 |

24 | richard_iii | history | 56 | 3 | 3 |

25 | romeo_and_juliet | tragedy | 35 | 3 | 3 |

26 | the_comedy_of_errors | comedy | 18 | 3 | 3 |

27 | the_life_and_death_of_king_john | history | 27 | 3 | 3 |

28 | the_merchant_of_venice | comedy | 21 | 3 | 3 |

29 | the_merry_wives_of_windsor | comedy | 23 | 2 | 2 |

30 | the_taming_of_the_shrew | comedy | 33 | 4 | 5 |

31 | the_tempest | comedy | 18 | 3 | 3 |

32 | the_two_gentlemen_of_verona | comedy | 18 | 5 | 5 |

33 | the_winters_tale | comedy | 32 | 3 | 3 |

34 | timon_of_athens | tragedy | 57 | 8 | 10 |

35 | titus_andronicus | tragedy | 26 | 3 | 3 |

36 | troilus_and_cressida | tragedy | 25 | 6 | 6 |

37 | twelfth_night_or_what_you_will | comedy | 18 | 4 | 4 |

The table above provides a summary of some quantitative data available on Shakespeare’s plays. The playwright has composed a total of 37 plays. Among them, there are 14 comedies, 12 tragedies, and 11 history plays. An interesting finding which can be seen in the table is that for 36 out of 37 plays in the dataset, the number of clusters by time is the same as the number of clusters by speech (given that no particular cut-off is set). The exception is the tragedy “Timon of Athens,” whose characters are grouped into eight clusters by time and ten clusters by speech. The table also includes data on the number of characters in each play, and I elaborate on these figures further below.

The dotted lines on the plot indicate the mean number of characters for each genre. There are on average 22.7 characters in a comedy, 38.4 characters in a tragedy, and 44.4 characters in a history play. This is consistent with what I assumed in the introduction.

The purpose of the first model is to provide evidence for the first thesis statement. I am creating a model where the number of clusters (by time) for a play is the function of the number of characters in that play.

```
model <- lm(clusters_time ~ characters, data = data_p3)
summary(model)
```

```
##
## Call:
## lm(formula = clusters_time ~ characters, data = data_p3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3124 -1.2116 -0.3208 1.6792 4.8935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43001 0.91019 1.571 0.12516
## characters 0.10504 0.02484 4.228 0.00016 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.969 on 35 degrees of freedom
## Multiple R-squared: 0.3381, Adjusted R-squared: 0.3192
## F-statistic: 17.88 on 1 and 35 DF, p-value: 0.0001605
```

`model`

```
##
## Call:
## lm(formula = clusters_time ~ characters, data = data_p3)
##
## Coefficients:
## (Intercept) characters
## 1.430 0.105
```

I also devise a similar model for clusters by speech.

```
model <- lm(clusters_speech ~ characters, data = data_p3)
summary(model)
```

```
##
## Call:
## lm(formula = clusters_speech ~ characters, data = data_p3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5553 -1.2574 -0.3279 1.7623 4.8329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.18685 0.91640 1.295 0.204
## characters 0.11372 0.02501 4.547 6.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.982 on 35 degrees of freedom
## Multiple R-squared: 0.3713, Adjusted R-squared: 0.3534
## F-statistic: 20.67 on 1 and 35 DF, p-value: 6.257e-05
```

`model`

```
##
## Call:
## lm(formula = clusters_speech ~ characters, data = data_p3)
##
## Coefficients:
## (Intercept) characters
## 1.1868 0.1137
```

The first model serves as evidence supporting the first thesis statement that a the higher the number of characters in a play, the greater the complexity of a play. This is true for both, clusters by time and clusters by speech. This can be proven by the positive coefficients - 0.105 and 0.1137, respectively. It also ought to be born in mind that the model shows a very low p-value, indicating that the number of characters is a very significant predictor of the complexity of a play.

The purpose of the second model is to provide evidence for the second thesis statement. I am creating a univariate model where the number of clusters (by time) for a play is the function of its genre. “Comedy” serves as the point of reference here.

```
model <- lm(clusters_time ~ genre, data = data_p3)
summary(model)
```

```
##
## Call:
## lm(formula = clusters_time ~ genre, data = data_p3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9167 -0.9167 -0.5714 1.0833 4.0909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5714 0.5748 6.214 4.55e-07 ***
## genrehistory 2.3377 0.8665 2.698 0.01079 *
## genretragedy 2.3452 0.8460 2.772 0.00897 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.151 on 34 degrees of freedom
## Multiple R-squared: 0.2328, Adjusted R-squared: 0.1877
## F-statistic: 5.159 on 2 and 34 DF, p-value: 0.01105
```

`model`

```
##
## Call:
## lm(formula = clusters_time ~ genre, data = data_p3)
##
## Coefficients:
## (Intercept) genrehistory genretragedy
## 3.571 2.338 2.345
```

I also devise a similar model for clusters by speech.

```
model <- lm(clusters_speech ~ genre, data = data_p3)
summary(model)
```

```
##
## Call:
## lm(formula = clusters_speech ~ genre, data = data_p3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0833 -1.0833 -0.5714 1.4286 4.0909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5714 0.5928 6.025 7.99e-07 ***
## genrehistory 2.3377 0.8936 2.616 0.01318 *
## genretragedy 2.5119 0.8725 2.879 0.00686 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.218 on 34 degrees of freedom
## Multiple R-squared: 0.2354, Adjusted R-squared: 0.1905
## F-statistic: 5.235 on 2 and 34 DF, p-value: 0.01042
```

`model`

```
##
## Call:
## lm(formula = clusters_speech ~ genre, data = data_p3)
##
## Coefficients:
## (Intercept) genrehistory genretragedy
## 3.571 2.338 2.512
```

The analysis of the model shows that, as stated in the second thesis, the genre influences the complexity of the play. According to the results of modelling, tragedies are the most complex, followed by history plays, followed by comedies. At the same time, the p-values here are higher than in case of the number of characters. The statistical significance here is at 99% for history plays, and at 99.9% for tragedies.

The graphs below serve as an aid to visualize the results of the modeling analysis above. I plotted the number of clusters by time and by speech in every Shakespearean play in the dataset.