Exploratory data analysis (EDA) is the process of trying to understand what a dataset can tell us outside of a narrowly focused probability statement. It relies heavily on graphical devices and summary statistics. EDA is a very important aspect of data analysis and will contribute a large part of what we work on this semester.
It will take a few weeks before we have covered all of the core skills needed to do an exploratory analysis of a dataset. We’ll get there eventually, but I think it is useful to see at the start what exactly is involved in such an analysis. So, today, we will be looking at some online, interactive visualizations that we can explore without involving R or any data processing skills.
Types of data sets
When I talk about data I tend to think of a broader meaning to the term than some people might. Data can include such things as text:
And here, even one with sound:
Today the projects groups will look at will cover a similarly wide range of data set types.
Exploratory data analysis
What’s the point an exploratory, visual analysis? At a high level we are trying to understand something about the world around us. The EDA phase of analysis let’s us discover what the data at hand is specifically telling us and indicates directions for future studies.
Here is a project from the Digital Scholarship Lab (DSL) at the University of Richmond that does a great job of showing how visualizations can be powerful argumentative tools:
If you are not familiar with the DSL, I encourage you to check out their projects and visit the group’s space in the library. I plan to have someone from the DSL join us for a class later in the semester.
Modes of Analysis in EDA
Generally, when doing EDA we are looking at understanding patterns contained within the data. I tend see that interesting observations come in one of three forms:
- pattern description: an observation about a general pattern or trend found in the dataset
- anomaly detection: identification of data points that, in some way, seem to not follow the general pattern or otherwise behave in some extreme or anomalous way
- perspective: we have a special interest in a small set of data points (often just one) and are interested in how this point falls relative to the rest of the data; like anomaly detection, but we decide which points to care about ahead of the analysis
These do not function in a vacuum and good analyses will often pull from multiple types of observations.
For today, here are some questions to answer about as we look at interactive visualizations:
- What is the source of the data?
- What types of visualizations are being used?
- What are 2-3 interesting observations you found in the dataset?
I’ll answer these for the Forced Migration dataset, then let you work on these questions in small groups.
Now, its your turn. I have gather several interactive visualizations that I find particularly interesting. You’ll be split into groups of about 3 people and assigned one of these datasets.
- A Day in the Life
- Visualizing MBTA Data
- The Wizards’ shooting stars
- The Voting Habits of Americans Like You
- The Rhythm of Food
- Executive Abroad
- Foreign Born Population (1850-2000)
- #SOTU2014: See the State of The Union address minute by minute on Twitter
- A Map of Baseball Nation
Spend about 15 minutes playing around with the visualization and understanding the data within it. Then, answer the four questions above. Finally, designate one person to come up and explain the answers to your dataset.
For next class, please read the introduction to the text The Art of Data Science by Roger Peng and Elizabeth Matsui:
This should be a quick and easy read. It will help frame the discussion of the steps involved in making use of data in an analysis.