Introduction

The primary objective of this project is to answer the question: What factors are useful for predicting the complexity of Shakespeare’s plays? William Shakespeare’s plays are traditionally divided into the three genres of comedy, tragedy, and history plays. Each of the genres has some very distinct features. Comedies generally have lighter themes with simple love plots and happy endings. Tragedies are very serious and depict the tragic fate of numerous characters. History plays are also rather complex, as they need to accurately depict the zeitgeist.

In the context of this project, I operationalize complexity as the number of clusters. Clustering is a classification of data objects - here, characters in Shakespeare’s plays - into similarity groups according to a distance measure. The provided data includes two distance measures: time and speech. I compare the effects of choosing either of the two measures. I believe that clustering is an appropriate measure of complexity of Shakespeare’s plays, because as the number of distinct groups among the character increases, we can assume a greater complexity of the story line (an increase in the number of plots).

Consequently, I will use quantitative data available on Shakespeare’s plays to argue the following two points: (1) Number of characters can be used as a predictor of the complexity of the play. The higher the number of characters in a play, the greater its complexity. (2) The genre of the play is useful in predicting the complexity of a Shakespearean play. Tragedies are the most complex, followed by historical plays, followed by comedies.

Some summary information about Shakespeare’s plays
id title genre characters clusters_time clusters_speech
1 a_midsummer_nights_dream comedy 25 3 3
2 alls_well_that_ends_well comedy 23 3 3
3 anthony_and_cleopatra tragedy 54 7 7
4 as_you_like_it comedy 24 7 7
5 coriolanus tragedy 58 10 10
6 cymbeline tragedy 37 9 9
7 hamlet_prince_of_denmark tragedy 35 5 5
8 henry_iv_part_1 history 36 4 4
9 henry_iv_part_2 history 46 6 6
10 henry_v history 42 6 6
11 henry_vi_part_1 history 55 5 5
12 henry_vi_part_2 history 62 10 10
13 henry_vi_part_3 history 43 5 5
14 henry_viii history 43 5 5
15 julius_caesar tragedy 45 5 5
16 king_lear tragedy 26 7 7
17 loves_labours_lost comedy 18 2 2
18 macbeth tragedy 39 6 6
19 measure_for_measure comedy 24 6 5
20 much_ado_about_nothing comedy 23 2 2
21 othello_the_moor_of_venice tragedy 24 2 2
22 pericles_prince_of_tyre history 43 8 8
23 richard_ii history 35 10 10
24 richard_iii history 56 3 3
25 romeo_and_juliet tragedy 35 3 3
26 the_comedy_of_errors comedy 18 3 3
27 the_life_and_death_of_king_john history 27 3 3
28 the_merchant_of_venice comedy 21 3 3
29 the_merry_wives_of_windsor comedy 23 2 2
30 the_taming_of_the_shrew comedy 33 4 5
31 the_tempest comedy 18 3 3
32 the_two_gentlemen_of_verona comedy 18 5 5
33 the_winters_tale comedy 32 3 3
34 timon_of_athens tragedy 57 8 10
35 titus_andronicus tragedy 26 3 3
36 troilus_and_cressida tragedy 25 6 6
37 twelfth_night_or_what_you_will comedy 18 4 4

The table above provides a summary of some quantitative data available on Shakespeare’s plays. The playwright has composed a total of 37 plays. Among them, there are 14 comedies, 12 tragedies, and 11 history plays. An interesting finding which can be seen in the table is that for 36 out of 37 plays in the dataset, the number of clusters by time is the same as the number of clusters by speech (given that no particular cut-off is set). The exception is the tragedy “Timon of Athens,” whose characters are grouped into eight clusters by time and ten clusters by speech. The table also includes data on the number of characters in each play, and I elaborate on these figures further below.

The dotted lines on the plot indicate the mean number of characters for each genre. There are on average 22.7 characters in a comedy, 38.4 characters in a tragedy, and 44.4 characters in a history play. This is consistent with what I assumed in the introduction.

Model 1: Clusters and number of characters

The purpose of the first model is to provide evidence for the first thesis statement. I am creating a model where the number of clusters (by time) for a play is the function of the number of characters in that play.

model <- lm(clusters_time ~ characters, data = data_p3)
summary(model)
##
## Call:
## lm(formula = clusters_time ~ characters, data = data_p3)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.3124 -1.2116 -0.3208  1.6792  4.8935
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.43001    0.91019   1.571  0.12516
## characters   0.10504    0.02484   4.228  0.00016 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.969 on 35 degrees of freedom
## Multiple R-squared:  0.3381, Adjusted R-squared:  0.3192
## F-statistic: 17.88 on 1 and 35 DF,  p-value: 0.0001605
model
##
## Call:
## lm(formula = clusters_time ~ characters, data = data_p3)
##
## Coefficients:
## (Intercept)   characters
##       1.430        0.105

I also devise a similar model for clusters by speech.

model <- lm(clusters_speech ~ characters, data = data_p3)
summary(model)
##
## Call:
## lm(formula = clusters_speech ~ characters, data = data_p3)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.5553 -1.2574 -0.3279  1.7623  4.8329
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.18685    0.91640   1.295    0.204
## characters   0.11372    0.02501   4.547 6.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.982 on 35 degrees of freedom
## Multiple R-squared:  0.3713, Adjusted R-squared:  0.3534
## F-statistic: 20.67 on 1 and 35 DF,  p-value: 6.257e-05
model
##
## Call:
## lm(formula = clusters_speech ~ characters, data = data_p3)
##
## Coefficients:
## (Intercept)   characters
##      1.1868       0.1137

The first model serves as evidence supporting the first thesis statement that a the higher the number of characters in a play, the greater the complexity of a play. This is true for both, clusters by time and clusters by speech. This can be proven by the positive coefficients - 0.105 and 0.1137, respectively. It also ought to be born in mind that the model shows a very low p-value, indicating that the number of characters is a very significant predictor of the complexity of a play.

Model 2: Clusters and genre

The purpose of the second model is to provide evidence for the second thesis statement. I am creating a univariate model where the number of clusters (by time) for a play is the function of its genre. “Comedy” serves as the point of reference here.

model <- lm(clusters_time ~ genre, data = data_p3)
summary(model)
##
## Call:
## lm(formula = clusters_time ~ genre, data = data_p3)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.9167 -0.9167 -0.5714  1.0833  4.0909
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)    3.5714     0.5748   6.214 4.55e-07 ***
## genrehistory   2.3377     0.8665   2.698  0.01079 *
## genretragedy   2.3452     0.8460   2.772  0.00897 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.151 on 34 degrees of freedom
## Multiple R-squared:  0.2328, Adjusted R-squared:  0.1877
## F-statistic: 5.159 on 2 and 34 DF,  p-value: 0.01105
model
##
## Call:
## lm(formula = clusters_time ~ genre, data = data_p3)
##
## Coefficients:
##  (Intercept)  genrehistory  genretragedy
##        3.571         2.338         2.345

I also devise a similar model for clusters by speech.

model <- lm(clusters_speech ~ genre, data = data_p3)
summary(model)
##
## Call:
## lm(formula = clusters_speech ~ genre, data = data_p3)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.0833 -1.0833 -0.5714  1.4286  4.0909
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)    3.5714     0.5928   6.025 7.99e-07 ***
## genrehistory   2.3377     0.8936   2.616  0.01318 *
## genretragedy   2.5119     0.8725   2.879  0.00686 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.218 on 34 degrees of freedom
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.1905
## F-statistic: 5.235 on 2 and 34 DF,  p-value: 0.01042
model
##
## Call:
## lm(formula = clusters_speech ~ genre, data = data_p3)
##
## Coefficients:
##  (Intercept)  genrehistory  genretragedy
##        3.571         2.338         2.512

The analysis of the model shows that, as stated in the second thesis, the genre influences the complexity of the play. According to the results of modelling, tragedies are the most complex, followed by history plays, followed by comedies. At the same time, the p-values here are higher than in case of the number of characters. The statistical significance here is at 99% for history plays, and at 99.9% for tragedies.

Complexity and genres - visualization

The graphs below serve as an aid to visualize the results of the modeling analysis above. I plotted the number of clusters by time and by speech in every Shakespearean play in the dataset.