Data Normalization

Database normalization is a concept from database theory that describes a hierarchy of increasingly strict guidelines for how to structure data across multiple tables. The benefits of tables that are more normalized is that they are less error-prone, more efficient, reduce redundancies, and improve data integrity. From a data science perspective, working with normalized tables often leads to cleaner code and analyses. However, it can more difficult to collect the original data in a normalized format. Further, while it is usually more straightforward to write code based on normalized formats, it often involves more steps than working with a less normalized set of tables. So, like everything, there are always benefits and tradeoffs, which makes it even more important to learn about the concept so that you can make good choices in your own work.

In these notes I will describe three increasingly strict types of data normalization: first normal form, second normal form, and third normal form. My recommendation when doing data science is to try to always respect the first normal form at all times. The other two (stricter) forms of normalization are useful to consider, but do not feel the need to following them dogmatically, particularly during the initial data collection process.

First Normal Form (1NF)

The first normal form, often abbreviated as 1NF, simply requires that each value (individual element) in a table is an individual element. Formally, the technical definition of 1NF only disallows hierarchical structures such as “tables whose elements are themselves inside tables”. While it is possible to have what are called “nested tables” in R, we have not seen these in this course and will not have a need to create them (they come up frequently in DSST389, though). In the context of data science practice, however, the 1NF rule typically corresponds to the stricter condition that no value should contain more than one individual piece of information. This is the definition that we will use in this course.

For an example of a dataset that does not respect 1NF, let’s look at the data from the fifth anti-pattern discussed in the recent slides:

## # A tibble: 5 × 3
##   student_name year     fav_foods              
##   <chr>        <chr>    <chr>                  
## 1 Sally        sophmore hotdogs; pizza; nachos 
## 2 Bob          sophmore bagels; sandwiches     
## 3 Jill         junior   ice cream              
## 4 Jack         junior   milkshakes; cookies    
## 5 Mary         senior   ribeye; lobster; caviar

The data table above is not in 1NF because it has more than one type of food in some of the values in the last column. In order to create a 1NF version of the data, we would need to re-write it like this:

## # A tibble: 11 × 3
##    student_name year     fav_foods 
##    <chr>        <chr>    <chr>     
##  1 Sally        sophmore hotdogs   
##  2 Sally        sophmore pizza     
##  3 Sally        sophmore nachos    
##  4 Bob          sophmore bagels    
##  5 Bob          sophmore sandwiches
##  6 Jill         junior   ice cream 
##  7 Jack         junior   milkshakes
##  8 Jack         junior   cookies   
##  9 Mary         senior   ribeye    
## 10 Mary         senior   lobster   
## 11 Mary         senior   caviar

The new form of the data now respects the 1NF rules. We will see in a future set of notes the code do perform this operation.

Second Normal Form (2NF)

Describing the second normal form requires defining a four concepts from database theory:

A table satisfies the second normal form, or 2NF, if it satisfies the conditions to 1NF and any feature outside of a candidate key depends on the entire value of a candidate key. An example will make this much more clear. We will primarily work with tables that have only a single option for the candidate key.

Consider the following version of the U.S. cities data we have been looking at this semester. It contains only a subset of the columns and the rows have been randomly re-arranged to make it more clear the relationships between the features.

## # A tibble: 6,900 × 5
##    city                  year population    lon   lat
##    <chr>                <dbl>      <dbl>  <dbl> <dbl>
##  1 Phoenix, AZ           1840         0  -112.   33.6
##  2 East Los Angeles, CA  1970       105    NA    NA  
##  3 Aurora, IL            2010       198.  -88.3  41.8
##  4 Berkeley, CA          1870         0  -122.   37.9
##  5 Port St. Lucie, FL    1820         0   -80.4  27.3
##  6 Birmingham, AL        1830         0   -86.8  33.5
##  7 Amarillo, TX          1990       158. -102.   35.2
##  8 Columbia, MD          1850        NA    NA    NA  
##  9 Sunrise Manor, NV     1960        NA    NA    NA  
## 10 Phoenix, AZ           2010      1446. -112.   33.6
## # … with 6,890 more rows

The sole candidate key, and therefore primary key, in this data table consists of the first two features, \(\{\texttt{city},\, \texttt{year}\}\). Knowing this, consider how the other features in the data depend on the values in the candidate key. The value of the population feature depends on knowing both the city and the year; however, the longitude and latitude of the city in this data set only depend on the city. The city does not change locations over time, and so the lon and lat columns only depend on city and not year. This is a bit more clear perhaps if we look at a single city:

## # A tibble: 23 × 5
##    city          year population   lon   lat
##    <chr>        <dbl>      <dbl> <dbl> <dbl>
##  1 Richmond, VA  1790       3.76 -77.5  37.5
##  2 Richmond, VA  1800       5.74 -77.5  37.5
##  3 Richmond, VA  1810       9.74 -77.5  37.5
##  4 Richmond, VA  1820      12.1  -77.5  37.5
##  5 Richmond, VA  1830      16.1  -77.5  37.5
##  6 Richmond, VA  1840      20.2  -77.5  37.5
##  7 Richmond, VA  1850      27.6  -77.5  37.5
##  8 Richmond, VA  1860      37.9  -77.5  37.5
##  9 Richmond, VA  1870      51.0  -77.5  37.5
## 10 Richmond, VA  1880      63.6  -77.5  37.5
## # … with 13 more rows

As you can see above, the data contains a lot of duplicated information because it repeats the location of Richmond, Virginia every year even though the location does not change.

To make the data respect the 2NF, we need to create two different tables. One would contain information about the original candidate key, which is just the result of removing the lon and lat features:

## # A tibble: 6,900 × 3
##    city                  year population
##    <chr>                <dbl>      <dbl>
##  1 Phoenix, AZ           1840         0 
##  2 East Los Angeles, CA  1970       105 
##  3 Aurora, IL            2010       198.
##  4 Berkeley, CA          1870         0 
##  5 Port St. Lucie, FL    1820         0 
##  6 Birmingham, AL        1830         0 
##  7 Amarillo, TX          1990       158.
##  8 Columbia, MD          1850        NA 
##  9 Sunrise Manor, NV     1960        NA 
## 10 Phoenix, AZ           2010      1446.
## # … with 6,890 more rows

And having another table with a single row for each city providing its location:

## # A tibble: 300 × 3
##    city               lon   lat
##    <chr>            <dbl> <dbl>
##  1 Abilene, TX      -99.7  32.5
##  2 Akron, OH        -81.5  41.1
##  3 Albany, NY       -73.8  42.7
##  4 Albuquerque, NM -107.   35.1
##  5 Alexandria, VA   -77.1  38.8
##  6 Allentown, PA    -75.5  40.6
##  7 Amarillo, TX    -102.   35.2
##  8 Anaheim, CA     -118.   33.9
##  9 Anchorage, AK   -149.   61.2
## 10 Ann Arbor, MI    -83.7  42.3
## # … with 290 more rows

It is almost always the case that converting a dataset that does not respect 2NF into one that does respect 2NF involves split data into multiple tables.

A useful concept is the idea of a functional dependency, which occurs when one feature in the dataset is dependent to the value(s) of other feature(s). A fancy way of stating the second normal form is to say that no non-prime feature has a functional dependency on a proper subset of any candidate key.

Third Normal Form (3NF)

The next and final most strict normal form is called, somewhat unsurprisingly, the third normal form, or 3NF. A table is in 3NF if the table is in 2NF and all of the features that are not part of any candidate key depend only on the entire candidate key. At first this seems a lot like a redundant version of 2NF, but there is a subtle difference. Let’s see an example to understand the distinction.

We will take a portion of the food table combined with part of the diet table that we saw in one of the previous notebooks.

## # A tibble: 61 × 3
##    item        food_group vegan
##    <chr>       <chr>      <chr>
##  1 Apple       fruit      yes  
##  2 Asparagus   vegetable  yes  
##  3 Avocado     fruit      yes  
##  4 Banana      fruit      yes  
##  5 Chickpea    grains     yes  
##  6 String Bean vegetable  yes  
##  7 Beef        meat       no   
##  8 Bell Pepper vegetable  yes  
##  9 Crab        fish       no   
## 10 Broccoli    vegetable  yes  
## # … with 51 more rows

This table satisfies the 2NF conditions but not the 3NF conditions. To see why, notice that the only candidate key in this table is \(\{\texttt{item},\}\). Each of the other features, food_group and vegan, depend on the value of the entire candidate key (there is only one feature in the candidate key, so we get 2NF almost for free). However, there is still some redundancy in the data, because the column vegan is entirely dependent on the column food_group. So, there is a functional dependency between the last feature and a feature that is not part of the primary key, hence 3NF does not hold. To fix the problem, we just return to the original two data tables:

## # A tibble: 61 × 2
##    item        food_group
##    <chr>       <chr>     
##  1 Apple       fruit     
##  2 Asparagus   vegetable 
##  3 Avocado     fruit     
##  4 Banana      fruit     
##  5 Chickpea    grains    
##  6 String Bean vegetable 
##  7 Beef        meat      
##  8 Bell Pepper vegetable 
##  9 Crab        fish      
## 10 Broccoli    vegetable 
## # … with 51 more rows

And:

## # A tibble: 6 × 2
##   food_group vegan
##   <chr>      <chr>
## 1 fruit      yes  
## 2 vegetable  yes  
## 3 grains     yes  
## 4 meat       no   
## 5 fish       no   
## 6 dairy      no

There is one commonly seen type of even more strict normalization called Boyce–Codd normal form, abbreviated as either BCNF or 3.5NF. It involves a very small change to the the 3NF conditions that are difficult to describe and not particularly important in Data Science. It involves some clever thinking to even come up with an example of a data set that satisfies 3NF but does not satisfy BCNF. There are even higher forms, including 4NF, 5NF, and 6NF, that are largely only of theoretical concern and often not even discussed in a full-semester database course. So, we will end our lesson here.

As a final note, know that the 3NF condition was re-defined by Hadley Wickham (the author of packages such as ggplot2 and dplyr) using the concept of “Tidy Data”. If you see this term or its variants in data science literature or packages, know that it relates to the data normalization concepts presented here.

Movies Data

In class next week we will be working on two notebooks that investigate a dataset about the top 100 grossing films for each year from 1970 to 2019. The data comes from IMDb. I have (almost) organized the data into the 3NF format across four different tables. Let’s get familiar with the data here before attempting the somewhat difficult questions in this week’s notebooks.

movies <- read_csv(file.path("..", "data", "movies_50_years.csv"))
m_genre <- read_csv(file.path("..", "data", "movies_50_years_genre.csv"))
m_people <- read_csv(file.path("..", "data", "movies_50_years_people.csv"))
m_color <- read_csv(file.path("..", "data", "movies_50_years_color.csv"))

The movies dataset contains one row for each movie. Most of the variables are fairly straightforward; you can see the units by opening the data dictionary. Three variables concern the movie’s theatrical poster. These given the average brightness (average pixel intensity), saturation (are the colors bright or washed out / white), and complexity (a cartoon image would have a low complexity; lots of text or icons would have a high complexity).

movies
## # A tibble: 5,000 × 12
##     year title    mpa   runtime gross ratin…¹ rating metac…² poste…³ poste…⁴
##    <dbl> <chr>    <chr>   <dbl> <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
##  1  1970 Love St… PG        100 106.    28330    6.9      NA    79.0    8.03
##  2  1970 Airport  G         137 100.    16512    6.6      42    70.7   29.3 
##  3  1970 MASH     R         116  81.6   64989    7.5      NA    74.5   40.1 
##  4  1970 Patton   GP        172  61.7   90461    7.9      NA    83.1   17.4 
##  5  1970 The Ari… G          78  37.7   87551    7.1      NA    79.8   12.5 
##  6  1970 Little … PG-13     139  31.6   31412    7.6      NA    68.0    9.02
##  7  1970 Tora! T… G         144  29.6   30347    7.5      46    39.8   48.6 
##  8  1970 Catch-22 R         122  24.9   20997    7.2      NA    62.3   36.0 
##  9  1970 The Owl… PG         95  23.7    3107    6.5      NA    67.2   10.2 
## 10  1970 Joe      R         107  19.3    2633    6.8      NA    31.8   27.6 
## # … with 4,990 more rows, 2 more variables: poster_edgeness <dbl>,
## #   description <chr>, and abbreviated variable names ¹​rating_count,
## #   ²​metacritic, ³​poster_brightness, ⁴​poster_saturation

A second dataset gives more detailed information about each poster by indicating how much of a poster is of a certain color. If you want to look at the movie poster itself, just search for a film on IMDb and search for the film. The poster is the first image on the film’s page.

m_color
## # A tibble: 46,980 × 5
##     year title      color_type color  percentage
##    <dbl> <chr>      <chr>      <chr>       <dbl>
##  1  1970 Love Story hue        red      2.64    
##  2  1970 Love Story hue        orange   3.09    
##  3  1970 Love Story hue        yellow   0.0542  
##  4  1970 Love Story hue        green    0.236   
##  5  1970 Love Story hue        blue     0.319   
##  6  1970 Love Story hue        violet   0.000534
##  7  1970 Love Story greyscale  black   11.0     
##  8  1970 Love Story greyscale  grey     6.91    
##  9  1970 Love Story greyscale  white   75.8     
## 10  1970 Love Story hue        other    0       
## # … with 46,970 more rows

We also have a dataset of movie genres. The data structure is straightforward, but needs to be kept in its own table because a single movies can be assigned to multiple genres.

m_genre
## # A tibble: 11,887 × 3
##     year title      genre    
##    <dbl> <chr>      <chr>    
##  1  1970 Love Story Drama    
##  2  1970 Love Story Romance  
##  3  1970 Airport    Action   
##  4  1970 Airport    Drama    
##  5  1970 Airport    Thriller 
##  6  1970 MASH       Comedy   
##  7  1970 MASH       Drama    
##  8  1970 MASH       War      
##  9  1970 Patton     Biography
## 10  1970 Patton     Drama    
## # … with 11,877 more rows

Finally, we also have a dataset of people associated with each film. We do not have a lot of metadata about the people, but I have added a prediction of each person’s gender based on U.S. Social Security records. These are not always correct (there is a confidence score included as well) but are useful for some aggregate analyses. Note that the feature rank indicates the order in which the names were listed in the IMDb dataset, which sometimes indicates the relative importance of the actors to the film.

m_people
## # A tibble: 24,648 × 7
##     year title      role      rank person         gender gender_conf
##    <dbl> <chr>      <chr>    <dbl> <chr>          <chr>        <dbl>
##  1  1970 Love Story director     1 Arthur Hiller  male         0.994
##  2  1970 Love Story starring     1 Ali MacGraw    male         0.688
##  3  1970 Love Story starring     2 Ryan O'Neal    male         0.977
##  4  1970 Love Story starring     3 John Marley    male         0.996
##  5  1970 Love Story starring     4 Ray Milland    male         0.984
##  6  1970 Airport    director     1 George Seaton  male         0.993
##  7  1970 Airport    director     2 Henry Hathaway male         0.994
##  8  1970 Airport    starring     1 Burt Lancaster male         1    
##  9  1970 Airport    starring     2 Dean Martin    male         0.988
## 10  1970 Airport    starring     3 George Kennedy male         0.993
## # … with 24,638 more rows

There is a lot to unpack with these datasets. We will need to make use of the many methods we have learned so far this semester to make sense of the data.

Homework

For homework, complete the following three questions:

  1. Identify the candidate keys for each of the four movies tables above. Do any tables have the same exact candidate keys? Do any tables have multiple candidate keys?
  2. Describe which candidate keys appear as foreign key(s) in other tables.
  3. Identify the two violations of 3NF in my data. Is this also a violation of 2NF? How would you normalize the data to put it properly into 3NF?

Try not to worry too much about the meaning of each column (we will have a full data dictionary when we work on these in class); just do your best to answer the questions given your intuition.

Homework Answers

For Q1, the candidate keys for the tables are as follows:

Note that we need both the year and title to describe a film because some films were re-made in a different year. None of the tables have alternative cadidate keys when looking at the full dataset.

For Q2, the movies candidate key appears on every other table as a foreign key. No other candidate keys appear on other tables.

For Q3, there are two violations. In the m_color table, there is a dependency between color_type and color. The latter is a proper subset of the candidate key and therefore breaks 2NF. The second violation occurs in m_genre. The columns gender and gender_conf are functions of the column person. This is a violation of 3NF but not 2ND because person is not part of the candidate key.