Database normalization is a concept from database theory that describes a hierarchy of increasingly strict guidelines for how to structure data across multiple tables. The benefits of tables that are more normalized is that they are less error-prone, more efficient, reduce redundancies, and improve data integrity. From a data science perspective, working with normalized tables often leads to cleaner code and analyses. However, it can more difficult to collect the original data in a normalized format. Further, while it is usually more straightforward to write code based on normalized formats, it often involves more steps than working with a less normalized set of tables. So, like everything, there are always benefits and tradeoffs, which makes it even more important to learn about the concept so that you can make good choices in your own work.
In these notes I will describe three increasingly strict types of data normalization: first normal form, second normal form, and third normal form. My recommendation when doing data science is to try to always respect the first normal form at all times. The other two (stricter) forms of normalization are useful to consider, but do not feel the need to following them dogmatically, particularly during the initial data collection process.
The first normal form, often abbreviated as 1NF, simply requires that each value (individual element) in a table is an individual element. Formally, the technical definition of 1NF only disallows hierarchical structures such as “tables whose elements are themselves inside tables”. While it is possible to have what are called “nested tables” in R, we have not seen these in this course and will not have a need to create them (they come up frequently in DSST389, though). In the context of data science practice, however, the 1NF rule typically corresponds to the stricter condition that no value should contain more than one individual piece of information. This is the definition that we will use in this course.
For an example of a dataset that does not respect 1NF, let’s look at the data from the fifth anti-pattern discussed in the recent slides:
## # A tibble: 5 × 3
## student_name year fav_foods
## <chr> <chr> <chr>
## 1 Sally sophmore hotdogs; pizza; nachos
## 2 Bob sophmore bagels; sandwiches
## 3 Jill junior ice cream
## 4 Jack junior milkshakes; cookies
## 5 Mary senior ribeye; lobster; caviar
The data table above is not in 1NF because it has more than one type of food in some of the values in the last column. In order to create a 1NF version of the data, we would need to re-write it like this:
## # A tibble: 11 × 3
## student_name year fav_foods
## <chr> <chr> <chr>
## 1 Sally sophmore hotdogs
## 2 Sally sophmore pizza
## 3 Sally sophmore nachos
## 4 Bob sophmore bagels
## 5 Bob sophmore sandwiches
## 6 Jill junior ice cream
## 7 Jack junior milkshakes
## 8 Jack junior cookies
## 9 Mary senior ribeye
## 10 Mary senior lobster
## 11 Mary senior caviar
The new form of the data now respects the 1NF rules. We will see in a future set of notes the code do perform this operation.
Describing the second normal form requires defining a four concepts from database theory:
A table satisfies the second normal form, or 2NF, if it satisfies the conditions to 1NF and any feature outside of a candidate key depends on the entire value of a candidate key. An example will make this much more clear. We will primarily work with tables that have only a single option for the candidate key.
Consider the following version of the U.S. cities data we have been looking at this semester. It contains only a subset of the columns and the rows have been randomly re-arranged to make it more clear the relationships between the features.
## # A tibble: 6,900 × 5
## city year population lon lat
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Phoenix, AZ 1840 0 -112. 33.6
## 2 East Los Angeles, CA 1970 105 NA NA
## 3 Aurora, IL 2010 198. -88.3 41.8
## 4 Berkeley, CA 1870 0 -122. 37.9
## 5 Port St. Lucie, FL 1820 0 -80.4 27.3
## 6 Birmingham, AL 1830 0 -86.8 33.5
## 7 Amarillo, TX 1990 158. -102. 35.2
## 8 Columbia, MD 1850 NA NA NA
## 9 Sunrise Manor, NV 1960 NA NA NA
## 10 Phoenix, AZ 2010 1446. -112. 33.6
## # … with 6,890 more rows
The sole candidate key, and therefore primary key, in this data table
consists of the first two features, \(\{\texttt{city},\, \texttt{year}\}\).
Knowing this, consider how the other features in the data depend on the
values in the candidate key. The value of the population
feature depends on knowing both the city and the year; however, the
longitude and latitude of the city in this data set only depend on the
city. The city does not change locations over time, and so the
lon
and lat
columns only depend on
city
and not year
. This is a bit more clear
perhaps if we look at a single city:
## # A tibble: 23 × 5
## city year population lon lat
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Richmond, VA 1790 3.76 -77.5 37.5
## 2 Richmond, VA 1800 5.74 -77.5 37.5
## 3 Richmond, VA 1810 9.74 -77.5 37.5
## 4 Richmond, VA 1820 12.1 -77.5 37.5
## 5 Richmond, VA 1830 16.1 -77.5 37.5
## 6 Richmond, VA 1840 20.2 -77.5 37.5
## 7 Richmond, VA 1850 27.6 -77.5 37.5
## 8 Richmond, VA 1860 37.9 -77.5 37.5
## 9 Richmond, VA 1870 51.0 -77.5 37.5
## 10 Richmond, VA 1880 63.6 -77.5 37.5
## # … with 13 more rows
As you can see above, the data contains a lot of duplicated information because it repeats the location of Richmond, Virginia every year even though the location does not change.
To make the data respect the 2NF, we need to create two different
tables. One would contain information about the original candidate key,
which is just the result of removing the lon
and
lat
features:
## # A tibble: 6,900 × 3
## city year population
## <chr> <dbl> <dbl>
## 1 Phoenix, AZ 1840 0
## 2 East Los Angeles, CA 1970 105
## 3 Aurora, IL 2010 198.
## 4 Berkeley, CA 1870 0
## 5 Port St. Lucie, FL 1820 0
## 6 Birmingham, AL 1830 0
## 7 Amarillo, TX 1990 158.
## 8 Columbia, MD 1850 NA
## 9 Sunrise Manor, NV 1960 NA
## 10 Phoenix, AZ 2010 1446.
## # … with 6,890 more rows
And having another table with a single row for each city providing its location:
## # A tibble: 300 × 3
## city lon lat
## <chr> <dbl> <dbl>
## 1 Abilene, TX -99.7 32.5
## 2 Akron, OH -81.5 41.1
## 3 Albany, NY -73.8 42.7
## 4 Albuquerque, NM -107. 35.1
## 5 Alexandria, VA -77.1 38.8
## 6 Allentown, PA -75.5 40.6
## 7 Amarillo, TX -102. 35.2
## 8 Anaheim, CA -118. 33.9
## 9 Anchorage, AK -149. 61.2
## 10 Ann Arbor, MI -83.7 42.3
## # … with 290 more rows
It is almost always the case that converting a dataset that does not respect 2NF into one that does respect 2NF involves split data into multiple tables.
A useful concept is the idea of a functional dependency, which occurs when one feature in the dataset is dependent to the value(s) of other feature(s). A fancy way of stating the second normal form is to say that no non-prime feature has a functional dependency on a proper subset of any candidate key.
The next and final most strict normal form is called, somewhat unsurprisingly, the third normal form, or 3NF. A table is in 3NF if the table is in 2NF and all of the features that are not part of any candidate key depend only on the entire candidate key. At first this seems a lot like a redundant version of 2NF, but there is a subtle difference. Let’s see an example to understand the distinction.
We will take a portion of the food
table combined with
part of the diet
table that we saw in one of the previous
notebooks.
## # A tibble: 61 × 3
## item food_group vegan
## <chr> <chr> <chr>
## 1 Apple fruit yes
## 2 Asparagus vegetable yes
## 3 Avocado fruit yes
## 4 Banana fruit yes
## 5 Chickpea grains yes
## 6 String Bean vegetable yes
## 7 Beef meat no
## 8 Bell Pepper vegetable yes
## 9 Crab fish no
## 10 Broccoli vegetable yes
## # … with 51 more rows
This table satisfies the 2NF conditions but not the 3NF conditions.
To see why, notice that the only candidate key in this table is \(\{\texttt{item},\}\). Each of the other
features, food_group
and vegan
, depend on the
value of the entire candidate key (there is only one feature in the
candidate key, so we get 2NF almost for free). However, there is still
some redundancy in the data, because the column vegan
is
entirely dependent on the column food_group
. So, there is a
functional dependency between the last feature and a feature that is not
part of the primary key, hence 3NF does not hold. To fix the problem, we
just return to the original two data tables:
## # A tibble: 61 × 2
## item food_group
## <chr> <chr>
## 1 Apple fruit
## 2 Asparagus vegetable
## 3 Avocado fruit
## 4 Banana fruit
## 5 Chickpea grains
## 6 String Bean vegetable
## 7 Beef meat
## 8 Bell Pepper vegetable
## 9 Crab fish
## 10 Broccoli vegetable
## # … with 51 more rows
And:
## # A tibble: 6 × 2
## food_group vegan
## <chr> <chr>
## 1 fruit yes
## 2 vegetable yes
## 3 grains yes
## 4 meat no
## 5 fish no
## 6 dairy no
There is one commonly seen type of even more strict normalization called Boyce–Codd normal form, abbreviated as either BCNF or 3.5NF. It involves a very small change to the the 3NF conditions that are difficult to describe and not particularly important in Data Science. It involves some clever thinking to even come up with an example of a data set that satisfies 3NF but does not satisfy BCNF. There are even higher forms, including 4NF, 5NF, and 6NF, that are largely only of theoretical concern and often not even discussed in a full-semester database course. So, we will end our lesson here.
As a final note, know that the 3NF condition was re-defined by Hadley Wickham (the author of packages such as ggplot2 and dplyr) using the concept of “Tidy Data”. If you see this term or its variants in data science literature or packages, know that it relates to the data normalization concepts presented here.
In class next week we will be working on two notebooks that investigate a dataset about the top 100 grossing films for each year from 1970 to 2019. The data comes from IMDb. I have (almost) organized the data into the 3NF format across four different tables. Let’s get familiar with the data here before attempting the somewhat difficult questions in this week’s notebooks.
<- read_csv(file.path("..", "data", "movies_50_years.csv"))
movies <- read_csv(file.path("..", "data", "movies_50_years_genre.csv"))
m_genre <- read_csv(file.path("..", "data", "movies_50_years_people.csv"))
m_people <- read_csv(file.path("..", "data", "movies_50_years_color.csv")) m_color
The movies dataset contains one row for each movie. Most of the variables are fairly straightforward; you can see the units by opening the data dictionary. Three variables concern the movie’s theatrical poster. These given the average brightness (average pixel intensity), saturation (are the colors bright or washed out / white), and complexity (a cartoon image would have a low complexity; lots of text or icons would have a high complexity).
movies
## # A tibble: 5,000 × 12
## year title mpa runtime gross ratin…¹ rating metac…² poste…³ poste…⁴
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1970 Love St… PG 100 106. 28330 6.9 NA 79.0 8.03
## 2 1970 Airport G 137 100. 16512 6.6 42 70.7 29.3
## 3 1970 MASH R 116 81.6 64989 7.5 NA 74.5 40.1
## 4 1970 Patton GP 172 61.7 90461 7.9 NA 83.1 17.4
## 5 1970 The Ari… G 78 37.7 87551 7.1 NA 79.8 12.5
## 6 1970 Little … PG-13 139 31.6 31412 7.6 NA 68.0 9.02
## 7 1970 Tora! T… G 144 29.6 30347 7.5 46 39.8 48.6
## 8 1970 Catch-22 R 122 24.9 20997 7.2 NA 62.3 36.0
## 9 1970 The Owl… PG 95 23.7 3107 6.5 NA 67.2 10.2
## 10 1970 Joe R 107 19.3 2633 6.8 NA 31.8 27.6
## # … with 4,990 more rows, 2 more variables: poster_edgeness <dbl>,
## # description <chr>, and abbreviated variable names ¹rating_count,
## # ²metacritic, ³poster_brightness, ⁴poster_saturation
A second dataset gives more detailed information about each poster by indicating how much of a poster is of a certain color. If you want to look at the movie poster itself, just search for a film on IMDb and search for the film. The poster is the first image on the film’s page.
m_color
## # A tibble: 46,980 × 5
## year title color_type color percentage
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1970 Love Story hue red 2.64
## 2 1970 Love Story hue orange 3.09
## 3 1970 Love Story hue yellow 0.0542
## 4 1970 Love Story hue green 0.236
## 5 1970 Love Story hue blue 0.319
## 6 1970 Love Story hue violet 0.000534
## 7 1970 Love Story greyscale black 11.0
## 8 1970 Love Story greyscale grey 6.91
## 9 1970 Love Story greyscale white 75.8
## 10 1970 Love Story hue other 0
## # … with 46,970 more rows
We also have a dataset of movie genres. The data structure is straightforward, but needs to be kept in its own table because a single movies can be assigned to multiple genres.
m_genre
## # A tibble: 11,887 × 3
## year title genre
## <dbl> <chr> <chr>
## 1 1970 Love Story Drama
## 2 1970 Love Story Romance
## 3 1970 Airport Action
## 4 1970 Airport Drama
## 5 1970 Airport Thriller
## 6 1970 MASH Comedy
## 7 1970 MASH Drama
## 8 1970 MASH War
## 9 1970 Patton Biography
## 10 1970 Patton Drama
## # … with 11,877 more rows
Finally, we also have a dataset of people associated with each film.
We do not have a lot of metadata about the people, but I have added a
prediction of each person’s gender based on U.S. Social Security
records. These are not always correct (there is a confidence score
included as well) but are useful for some aggregate analyses. Note that
the feature rank
indicates the order in which the names
were listed in the IMDb dataset, which sometimes indicates the relative
importance of the actors to the film.
m_people
## # A tibble: 24,648 × 7
## year title role rank person gender gender_conf
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1970 Love Story director 1 Arthur Hiller male 0.994
## 2 1970 Love Story starring 1 Ali MacGraw male 0.688
## 3 1970 Love Story starring 2 Ryan O'Neal male 0.977
## 4 1970 Love Story starring 3 John Marley male 0.996
## 5 1970 Love Story starring 4 Ray Milland male 0.984
## 6 1970 Airport director 1 George Seaton male 0.993
## 7 1970 Airport director 2 Henry Hathaway male 0.994
## 8 1970 Airport starring 1 Burt Lancaster male 1
## 9 1970 Airport starring 2 Dean Martin male 0.988
## 10 1970 Airport starring 3 George Kennedy male 0.993
## # … with 24,638 more rows
There is a lot to unpack with these datasets. We will need to make use of the many methods we have learned so far this semester to make sense of the data.
For homework, complete the following three questions:
Try not to worry too much about the meaning of each column (we will have a full data dictionary when we work on these in class); just do your best to answer the questions given your intuition.
For Q1, the candidate keys for the tables are as follows:
Note that we need both the year and title to describe a film because some films were re-made in a different year. None of the tables have alternative cadidate keys when looking at the full dataset.
For Q2, the movies candidate key appears on every other table as a foreign key. No other candidate keys appear on other tables.
For Q3, there are two violations. In the m_color
table, there is a dependency between color_type
and
color
. The latter is a proper subset of the candidate key
and therefore breaks 2NF. The second violation occurs in
m_genre. The columns gender
and
gender_conf
are functions of the column
person
. This is a violation of 3NF but not 2ND because
person
is not part of the candidate key.