- List the key components of Exploratory Data Analysis (EDA)
- Describe and memorize the modes of EDA
- Infer possible data sources from interactive visualizations
- Make an oral presentation from a data analysis
Exploratory Data Analysis
Exploratory data analysis (EDA) is the process of trying to understand what a dataset can tell us outside of a narrowly focused probability statement. It relies heavily on graphical devices and summary statistics. EDA is a very important aspect of data analysis and will contribute a large part of what we work on this semester.
Hans Rosling’s 200 Countries, 200 Years, 4 Minutes:
I have shown this to nearly all of my statistics courses, and while a bit dated it is still the best representation of what EDA is all about.
Types of data sets
Data can include such things as text:
And here, even one with sound:
Today the projects groups will look at will cover a similarly wide range of data set types.
Exploratory data analysis
What’s the point of an exploratory, visual analysis? At a high level we are trying to understand something about the world around us. The EDA phase of analysis let’s us discover what the data at hand is specifically telling us and indicates directions for future studies.
Here is a project from the Digital Scholarship Lab (DSL) at the University of Richmond that does a great job of showing how visualizations can be powerful argumentative tools:
If you are not familiar with the DSL, I encourage you to check out their projects and visit the group’s space in the library. I plan to have someone from the DSL join us for a class later in the semester.
Modes of Analysis in EDA
Generally, when doing EDA we are looking at understanding patterns contained within the data. I tend see that interesting observations come in one of three forms:
- pattern description: an observation about a general pattern or trend found in the dataset
- anomaly detection: identification of data points that, in some way, seem to not follow the general pattern or otherwise behave in some extreme or anomalous way
- perspective: we have a special interest in a small set of data points (often just one) and are interested in how this point falls relative to the rest of the data; like anomaly detection, but we decide which points to care about ahead of the analysis
These do not function in a vacuum and good analyses will often pull from multiple types of observations.
For today, here are some questions to answer about as we look at interactive visualizations:
- What is the source of the data?
- What are some of the variables (measurements) being used in the visualization?
- What types of visualizations are being used? If you don’t know the exact name of something, just describe it.
- What are 2-3 interesting observations you found in the dataset?
I’ll answer these for the Forced Migration dataset, then let you work on these questions in small groups.
Now, its your turn. I have gather several interactive visualizations that I find particularly interesting. You’ll be split into groups of and assigned one of these datasets.
- A Day in the Life
- Visualizing MBTA Data
- The Wizards’ shooting stars
- The Voting Habits of Americans Like You
- The Rhythm of Food
- Executive Abroad
- Foreign Born Population (1850-2000)
- #SOTU2014: See the State of The Union address minute by minute on Twitter
- A Map of Baseball Nation
Spend about 15 minutes playing around with the visualization and understanding the data within it. Then, answer the four questions above. Finally, designate one person to come up and explain the answers to your dataset.