16. Cleaning Antipatterns

In a recent set of slides, I shared some examples of ways that we should try to not organize our data. It is always better to avoid this antipatterns and collect our original data following the rules in Notebook 7 and respecting at (a minimum) the 1NF rules. However, because I thought it may be of interest, in these notes, I will show how it can be possible to fix these antipatterns from within R. You will see that some are relatively easy and others are quite involves, even for these small examples.

Antipattern 1

Here is the data from the first antipattern, which is missing a feature name for the first column of the data:

ap <- read_csv("../data/antipattern01.csv")
ap

## # A tibble: 5 × 5
##   ...1                country party            birth_year year_started
##   <chr>               <chr>   <chr>                 <dbl>        <dbl>
## 1 Sanna Marin         Finland Social Democrats       1985         2019
## 2 Élisabeth Borne     France  LREM                   1961         2022
## 3 Κυριάκος Μητσοτάκης Greece  New Democracy          1968         2019
## 4 Katrín Jakobsdóttir Iceland Left-Green             1976         2017
## 5 Kaja Kallas         Estonia Reform Party           1977         2021

Notice that R has given the feature the name ...1; we can rename this using the function rename:

ap %>%
  rename(name = ...1)

## # A tibble: 5 × 5
##   name                country party            birth_year year_started
##   <chr>               <chr>   <chr>                 <dbl>        <dbl>
## 1 Sanna Marin         Finland Social Democrats       1985         2019
## 2 Élisabeth Borne     France  LREM                   1961         2022
## 3 Κυριάκος Μητσοτάκης Greece  New Democracy          1968         2019
## 4 Katrín Jakobsdóttir Iceland Left-Green             1976         2017
## 5 Kaja Kallas         Estonia Reform Party           1977         2021

And that’s about it. Fixing one variable name is not a huge problem; the real challenge is if we had to do this for a large number of features.

Antipattern 2

In the second antipattern, we have a “*” in one of the numeric columns. When reading into R, the numeric column becomes a character column and we lose the ability to do any sort of numerical operations on it.

ap <- read_csv("../data/antipattern02.csv")
ap

## # A tibble: 5 × 5
##   name                country party            birth_year year_started
##   <chr>               <chr>   <chr>            <chr>             <dbl>
## 1 Sanna Marin         Finland Social Democrats 1985               2019
## 2 Élisabeth Borne     France  LREM             1961               2022
## 3 Κυριάκος Μητσοτάκης Greece  New Democracy    1968               2019
## 4 Katrín Jakobsdóttir Iceland Left-Green       1976*              2017
## 5 Kaja Kallas         Estonia Reform Party     1977               2021

We can fix this by using the stringi package and a series of mutates to reconver the column to a numeric type while building a new notes column:

ap %>%
  mutate(notes = stri_extract_first(birth_year, fixed = "*")) %>%
  mutate(birth_year = stri_replace_all(birth_year, "", fixed = "*")) %>%
  mutate(birth_year = as.integer(birth_year))

## # A tibble: 5 × 6
##   name                country party            birth_year year_started notes
##   <chr>               <chr>   <chr>                 <int>        <dbl> <chr>
## 1 Sanna Marin         Finland Social Democrats       1985         2019 <NA> 
## 2 Élisabeth Borne     France  LREM                   1961         2022 <NA> 
## 3 Κυριάκος Μητσοτάκης Greece  New Democracy          1968         2019 <NA> 
## 4 Katrín Jakobsdóttir Iceland Left-Green             1976         2017 *    
## 5 Kaja Kallas         Estonia Reform Party           1977         2021 <NA>

A similar technique can be used if you have units of measurement of currency signs written into the data. As with the first example, fixing one column is not too difficult. The problem comes when trying to fix many columns at once.

Antipattern 3

The third antipatterns concerns two tables that are put on sheet. This is actually a fairly difficult thing to do directly in R. Reading in the data all together is a mess:

ap <- read_csv("../data/antipattern03.csv")
ap

## # A tibble: 19 × 7
##    ...1  ...2    ...3  ...4  ...5  ...6    ...7 
##    <chr> <chr>   <chr> <chr> <chr> <chr>   <chr>
##  1 <NA>  Trial 1 <NA>  <NA>  <NA>  <NA>    <NA> 
##  2 <NA>  ALPHA   BETA  GAMMA DELTA EPSILON ZETA 
##  3 A     37      78    62    50    87      97   
##  4 B     88      39    11    95    5       36   
##  5 C     67      16    30    93    5       13   
##  6 D     91      10    90    59    93      67   
##  7 E     6       63    38    83    22      97   
##  8 F     10      51    48    32    14      51   
##  9 G     8       94    73    86    16      27   
## 10 <NA>  <NA>    <NA>  <NA>  <NA>  <NA>    <NA> 
## 11 <NA>  Trial 2 <NA>  <NA>  <NA>  <NA>    <NA> 
## 12 <NA>  ALPHA   BETA  GAMMA DELTA EPSILON ZETA 
## 13 A     92      52    11    71    100     55   
## 14 B     65      16    61    79    89      17   
## 15 C     14      46    67    93    78      90   
## 16 D     81      6     8     21    45      74   
## 17 E     43      12    27    54    15      81   
## 18 F     6       84    59    66    11      86   
## 19 G     54      9     67    24    65      17

The trick is to use the options skip and n_max to read each table into R separately. Here is the first table:

ap1 <- read_csv("../data/antipattern03.csv", skip = 2, n_max = 7)
ap1

## # A tibble: 7 × 7
##   ...1  ALPHA  BETA GAMMA DELTA EPSILON  ZETA
##   <chr> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 A        37    78    62    50      87    97
## 2 B        88    39    11    95       5    36
## 3 C        67    16    30    93       5    13
## 4 D        91    10    90    59      93    67
## 5 E         6    63    38    83      22    97
## 6 F        10    51    48    32      14    51
## 7 G         8    94    73    86      16    27

And here is the second:

ap2 <- read_csv("../data/antipattern03.csv", skip = 12, n_max = 7)
ap2

## # A tibble: 7 × 7
##   ...1  ALPHA  BETA GAMMA DELTA EPSILON  ZETA
##   <chr> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 A        92    52    11    71     100    55
## 2 B        65    16    61    79      89    17
## 3 C        14    46    67    93      78    90
## 4 D        81     6     8    21      45    74
## 5 E        43    12    27    54      15    81
## 6 F         6    84    59    66      11    86
## 7 G        54     9    67    24      65    17

We can fix the bad variable name, put the tables together, and add the trial information through a series of data verbs like this:

ap1 %>%
  mutate(trial = 1) %>%
  bind_rows(ap2) %>%
  mutate(trial = if_else(is.na(trial), 2, trial)) %>%
  rename(latin = ...1)

## # A tibble: 14 × 8
##    latin ALPHA  BETA GAMMA DELTA EPSILON  ZETA trial
##    <chr> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
##  1 A        37    78    62    50      87    97     1
##  2 B        88    39    11    95       5    36     1
##  3 C        67    16    30    93       5    13     1
##  4 D        91    10    90    59      93    67     1
##  5 E         6    63    38    83      22    97     1
##  6 F        10    51    48    32      14    51     1
##  7 G         8    94    73    86      16    27     1
##  8 A        92    52    11    71     100    55     2
##  9 B        65    16    61    79      89    17     2
## 10 C        14    46    67    93      78    90     2
## 11 D        81     6     8    21      45    74     2
## 12 E        43    12    27    54      15    81     2
## 13 F         6    84    59    66      11    86     2
## 14 G        54     9    67    24      65    17     2

If we wanted to create the even better long-format of the data described in the slides, we can add in a pivot and select function as well:

ap1 %>%
  mutate(trial = 1) %>%
  bind_rows(ap2) %>%
  mutate(trial = if_else(is.na(trial), 2, trial)) %>%
  rename(latin = ...1) %>%
  pivot_longer(-c(latin, trial), names_to = "greek", values_to = "number") %>%
  select(trial, latin, greek, number)

## # A tibble: 84 × 4
##    trial latin greek   number
##    <dbl> <chr> <chr>    <dbl>
##  1     1 A     ALPHA       37
##  2     1 A     BETA        78
##  3     1 A     GAMMA       62
##  4     1 A     DELTA       50
##  5     1 A     EPSILON     87
##  6     1 A     ZETA        97
##  7     1 B     ALPHA       88
##  8     1 B     BETA        39
##  9     1 B     GAMMA       11
## 10     1 B     DELTA       95
## # … with 74 more rows

And again, these things are doable with two tables that are otherwise well structured. If the sheet contains dozens of individual tables, it would be quite time-consuming to use this method to read in the data into R. Note that the choices of the n_max and skip functions required a bit of guess-and-check on my part to get this working.

Antipattern 4

In this antipattern there are hierarchical column names. This is one of the worst things you can run into in a dataset because there is really no simply answer about how to fix the data in R. Let’s illustrate this by showing how we could do this from the example. Reading the table in all at once is a mess:

ap <- read_csv("../data/antipattern04.csv")
ap

## # A tibble: 6 × 7
##   ...1         fall_semester ...3    ...4    spring_semester ...6    ...7   
##   <chr>        <chr>         <chr>   <chr>   <chr>           <chr>   <chr>  
## 1 student_name dsst289       math211 rhcs103 dsst389         math212 rhcs104
## 2 Sally        79            75      93      95              100     77     
## 3 Bob          84            77      80      97              82      88     
## 4 Jill         77            72      87      78              71      98     
## 5 Jack         91            98      91      78              99      98     
## 6 Mary         86            85      88      84              92      80

Let’s read in the data ignoring the first row as a starting point:

ap_tbl <- read_csv("../data/antipattern04.csv", skip = 1L)
ap_tbl

## # A tibble: 5 × 7
##   student_name dsst289 math211 rhcs103 dsst389 math212 rhcs104
##   <chr>          <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Sally             79      75      93      95     100      77
## 2 Bob               84      77      80      97      82      88
## 3 Jill              77      72      87      78      71      98
## 4 Jack              91      98      91      78      99      98
## 5 Mary              86      85      88      84      92      80

We can then pivot the data and get something that actually looks quite fine:

ap_tbl <- ap_tbl %>%
  pivot_longer(-c(student_name), names_to = "course", values_to = "grade")
ap_tbl

## # A tibble: 30 × 3
##    student_name course  grade
##    <chr>        <chr>   <dbl>
##  1 Sally        dsst289    79
##  2 Sally        math211    75
##  3 Sally        rhcs103    93
##  4 Sally        dsst389    95
##  5 Sally        math212   100
##  6 Sally        rhcs104    77
##  7 Bob          dsst289    84
##  8 Bob          math211    77
##  9 Bob          rhcs103    80
## 10 Bob          dsst389    97
## # … with 20 more rows

In this small example we could do some hand editing to figure out which rows are from the fall semester and which are from the spring semester. Let’s see if we can do this in a more programatic way that would sort of work for a larger table. Let’s read in the first row of column names as a seperate table:

ap_head <- read_csv("../data/antipattern04.csv", n_max = 1L)
ap_head

## # A tibble: 1 × 7
##   ...1         fall_semester ...3    ...4    spring_semester ...6    ...7   
##   <chr>        <chr>         <chr>   <chr>   <chr>           <chr>   <chr>  
## 1 student_name dsst289       math211 rhcs103 dsst389         math212 rhcs104

Some manipulation, now, can turn this into a metadata table telling us when each class was offered:

ap_head <- ap_head %>%
  select(-...1) %>%
  pivot_longer(-c(), names_to = "semester", values_to = "course") %>%
  mutate(semester = stri_replace(semester, "", fixed = "_semester")) %>%
  mutate(semester = if_else(stri_sub(semester, 1, 3) == "...", NA_character_, semester)) %>%
  fill(semester)
ap_head

## # A tibble: 6 × 2
##   semester course 
##   <chr>    <chr>  
## 1 fall     dsst289
## 2 fall     math211
## 3 fall     rhcs103
## 4 spring   dsst389
## 5 spring   math212
## 6 spring   rhcs104

This can then be join into the larger dataset to get the version of the data we wished had been stored in the first place.

ap_tbl %>%
  left_join(ap_head, by = "course")

## # A tibble: 30 × 4
##    student_name course  grade semester
##    <chr>        <chr>   <dbl> <chr>   
##  1 Sally        dsst289    79 fall    
##  2 Sally        math211    75 fall    
##  3 Sally        rhcs103    93 fall    
##  4 Sally        dsst389    95 spring  
##  5 Sally        math212   100 spring  
##  6 Sally        rhcs104    77 spring  
##  7 Bob          dsst289    84 fall    
##  8 Bob          math211    77 fall    
##  9 Bob          rhcs103    80 fall    
## 10 Bob          dsst389    97 spring  
## # … with 20 more rows

The technique here only works because there is a unique semester in which each class was held. Otherwise we’d need any even more involved technique to get the data into the right format. So, just please do not use hierarchical column names in your data collection!

Antipattern 5

The final antipattern is actually not that bad. In fact, while I never want my data to look like this when I am analyzing it, sometimes putting more than one thing in a value seperated by a common marker is actually an easy way to do data collection. Here is what the data look like in R:

ap <- read_csv("../data/antipattern05.csv")
ap

## # A tibble: 5 × 3
##   student_name year     fav_foods              
##   <chr>        <chr>    <chr>                  
## 1 Sally        sophmore hotdogs; pizza; nachos 
## 2 Bob          sophmore bagels; sandwiches     
## 3 Jill         junior   ice cream              
## 4 Jack         junior   milkshakes; cookies    
## 5 Mary         senior   ribeye; lobster; caviar

We can split apart the favorite foods using the stri_split function, but this results in a nested data table (the kind of thing we discussed when talking about the 1NF rules).

ap %>%
  mutate(fav_foods = stri_split(fav_foods, fixed = "; "))

## # A tibble: 5 × 3
##   student_name year     fav_foods
##   <chr>        <chr>    <list>   
## 1 Sally        sophmore <chr [3]>
## 2 Bob          sophmore <chr [2]>
## 3 Jill         junior   <chr [1]>
## 4 Jack         junior   <chr [2]>
## 5 Mary         senior   <chr [3]>

The unnest function comes to the rescue, though, and produces a table that respects the first normal form:

ap %>%
  mutate(fav_foods = stri_split(fav_foods, fixed = "; ")) %>%
  unnest(c(fav_foods))

## # A tibble: 11 × 3
##    student_name year     fav_foods 
##    <chr>        <chr>    <chr>     
##  1 Sally        sophmore hotdogs   
##  2 Sally        sophmore pizza     
##  3 Sally        sophmore nachos    
##  4 Bob          sophmore bagels    
##  5 Bob          sophmore sandwiches
##  6 Jill         junior   ice cream 
##  7 Jack         junior   milkshakes
##  8 Jack         junior   cookies   
##  9 Mary         senior   ribeye    
## 10 Mary         senior   lobster   
## 11 Mary         senior   caviar

We could recover the 3NF by splitting the data into two tables using the select function like this:

ap %>%
  select(student_name, year)

## # A tibble: 5 × 2
##   student_name year    
##   <chr>        <chr>   
## 1 Sally        sophmore
## 2 Bob          sophmore
## 3 Jill         junior  
## 4 Jack         junior  
## 5 Mary         senior

And this:

ap %>%
  select(-year) %>%
  mutate(fav_foods = stri_split(fav_foods, fixed = "; ")) %>%
  unnest(c(fav_foods))

## # A tibble: 11 × 2
##    student_name fav_foods 
##    <chr>        <chr>     
##  1 Sally        hotdogs   
##  2 Sally        pizza     
##  3 Sally        nachos    
##  4 Bob          bagels    
##  5 Bob          sandwiches
##  6 Jill         ice cream 
##  7 Jack         milkshakes
##  8 Jack         cookies   
##  9 Mary         ribeye    
## 10 Mary         lobster   
## 11 Mary         caviar

Homework Questions

Given last week’s exam, there are no homework questions for today.