## Objectives

Develop tools for summarizing a dataset numerically, and how we can use these summaries to produce interesting new visualizations.

## Mid-course survey summary

Thank you all again for filling out the mid-course review forms. They were very helpful for me in planning the notes for the remainder of the semester. Some themes I saw:

- generally happy with the format of the quizzes and the notes šļø I provide
- generally enjoy the short lecture followed by working on labs and projects
- you like using your own computer š»
- enjoy working on the projects in class
- a desire for more practice problems and labs š¬
- almost no comments on the pace š of the course (so I assume its good!)
- some clarity on what is expected on the project drafts
- request for more office hours ā° right before the projects are due

It did not seem that there was a pattern of many other concerns (yay!). Iāve tried to add a little bit more practice to the labs, but mostly kept the format of the course consistent with what we have been doing lately. Iāll try to clarify a bit more what is expected on the Project D drafts (there are no drafts for C). Also, I will have and announce extra office hours before the next assignment.

Additionally, I did take other one-off comments that were made under consideration even if not listed here. If you have any other comments or concerns, please still let me know!

## Motivation

Graphics are an excellent way of summarizing and
presenting information contained in a dataset. In many cases it
can be useful to combine these with purely numeric summaries.
These summaries are something colloquially called *statistics*,
though I prefer to avoid this terminology.

In these notes, I will use the `msleep`

dataset
in order to show various numerical summaries. Here is an example of
the dataset:

Each row corresponds to a type of mammal, and gives basic numeric values that describe their sleeping cycles.

## Mean

The first statistical summary that most people learn about is the **mean**,
also commonly known as an average. It is calculated by adding all of a
variables values together and dividing by the total number of values. If
we have a dataset of n points with a variable x (denoting x with an underscore
1 as the first value, with a 2 as the second, and so forth), the mean can be
formally defined as:

The notation of using x with a line above it to represent the mean is very common
throughout the sciences and social sciences. It is often used in textbooks
and papers without even being defined. To calculate means in R, as we have
already, seen we can use the `mean`

function. Here is an illustration that
`mean`

behaves as expected using the `sum`

and `nrow`

functions for
comparison.

Note, I am showing in the second line a description of what the mean is doing
and verifying that it works. Do **not** coyp and use the second form in your
work.

## Quantiles

Also here are a number of functions that allow us to compute quantiles,
a generalization of percentiles. For example, the `deciles`

function splits
the dataset into 10 equally sized buckets:

This shows that about 1/2 of the mammals are awake less than 14.20 hours and about
1/2 are awake more than 14.20. I use the word āaboutā here due to subtitles
regarding ties and repeated values; for all practical purposes this is generally not
important. Note that the 50% percentile has a special name that you have probably heard
before: the *median*.

Similarly, we see that roughly 1/10 of the mammals are awake less than 8.12 hours and 1/10 are awake less than 20.8 hours. We also see that the sleepiest mammal is awake for only 4.1 hours and that one mammal is awake 22.1 hours of the day.

We can similarly calculate what are called quartiles, splitting the data into
four equally sized groups using the `quartiles`

function:

Notice that four buckets requires 5 numbers, and that three of these line up with
the deciles above. There are also functions `ventiles`

(20) and `percentiles`

that
can be quite useful:

Ventiles are a bit esoteric, but I have found in my work that they can be very useful in practice. Percentiles are often useful when we want to look at the extreme values, such as the 97th, 98th and 99th percentiles.

## Deviation

Once we have defined the mean, we can then define the **deviation** of a
data value as the difference between the value and its mean:

There is not a special R function for deviances because they are very
easy to calculate using the `mean`

function. As an example, here is how
to create them:

Typically we will not need deviances directly, but they are used in the calculation of quantities measuring the variation about a mean.

## Variance

We can use deviations to measure the spread of a variable by adding the squared values of deviances. Why squares? For one thing, squaring the value makes negative deviations positive; though, the same effect would come from applying the absolution value function. The specific reason specifically for choosing the square is a bit too technical for our discussion.

The sum of squared deviances are calculated by the following formula:

And can be computed in R as:

The sum of squares cannot be used directly to compare datasets
of different sizes as it grows with the number of points. In order to compare
sums of squares across datasets, we use a measurement called **variance**
which is basically just the average of the sums of squares:

The notation of using s^2 to represent the variance of a dataset is quite common.

Why do we use (n-1) rather than (n) to take the average? The technical reason is that if we want to measure the variance of a population using a sample from that population, we need to use $n-1$ in order to have an unbiased estimate of the population valueā¦ The short answer is not to worry about it, which I strongly suggest at this point.

The variance can be computed manually as follows:

Or, using the `var`

function as follows:

Note: like the `mean`

function, you should **only** use the `var`

function
in your work. I show the other form simply as a demonstration of the definition.

## Standard deviation

We often work with a quantity called the **standard deviation**, defined as
simply the square root of the variance.

Why bother taking the square root? For one thing, it is a matter of units. In
our example, the variance is given in āsquared peopleā (a nearly meaningless
quantity), but the standard deviation is given in āpeopleā just like the variable
itself. We can calculate the standard deviation for the awake variable using the
function `sd`

:

Or, we can take out the `group_summarize`

function to compute the
standard deviation of the awake variable by the type of mammal:

Where we see that omnivores have more consistent hours spent sleeping and insectivores have the largest variation of sleeping hours per day.

## Graphing Variation

Finally, we can use these measurments of distribution and variation in graphical forms. Typically, this comes up when we have a grouping categorical variable and another numeric variable of interest.

A boxplot shows, for each group on the x-axis, the distribution of the variable on the y-axis. The solid bar indicates the y-axis variableās median and the height of the box and the āwhiskersā indicate measurments of variation (see the link boxplot for more information about the different ways these can be computed).

Similarly, a violin plot is a newer twist on the boxplot that attempts to show more details about the distribution by varying its width with the distribution of the points:

### Practice

We will work on the next lab for the remainder of the class: lab17.Rmd

Please upload your script to GitHub ahead of the next class.