The lab for today has you look at an exploratory analysis in R from two datasets. It introduces a lot of commands for data visualization that we will cover in detail next time.
The dataset here shows the number of children born in London on an annual basis, separated by their sex. Interactively in RStudio, you can see the data as a table by clicking on the table in the upper right-hand corner.
births <- read_csv("https://statsmaths.github.io/stat_data/arbuthnot.csv")
What years are included in the study?
Answer: We have data from 1629 to 1710.
Where/how do you think this data may have been collected?
Answer: It was actually collected through church baptism records, by far the most complete record of births in England at the time.
Now, let’s visualize the dataset by constructing several different plots. We will learn later how the plotting mechanism actually works. Today just run the code and enjoy.
ggplot(births, aes(year, total)) +
geom_line()
Approximately how many children were born in London in 1701?
Answer: 15 thousand
What is one change you would like to make to the way R has constructed the plot?
Answer: Add better labs for the x and y axes.
Click on the data in the top right hand pane. This will open an Excel like view of the dataset. Describe what the variable head_of_state
most likely means:
Answer: It means who was the head of state of the government in control of London. This changed, rather drastically, throughout the timeperiod of the dataset.
You might also have noticed the row of three letter abbreviations under the column names. These describe the type of each variable:
int
stands for integers.dbl
stands for doubles, or real numbers.chr
stands for character vectors, or strings.The types of graphics and data manipulation we can do with a given variable is highly determined by the variable data types.
Explain in words why the variables boys
and girls
are interpreted as integers.
Answer: These variables are counts and it does not make sense for us to have non-integer numbers here (say, half a person).
Create a bar plot of the heads of state:
ggplot(births, aes(head_of_state)) +
geom_bar()
Who was the longest serving head of state during the time period of this data set?
Answer: Charles II
How are the heads of state arranged be default?
Answer: Alphabetically.
We can add color to the line plot to combine the head of state information with the year and total number of births:
ggplot(births, aes(year, total)) +
geom_line(aes(color = head_of_state))