In a recent set of slides, I shared some examples of ways that we should try to not organize our data. It is always better to avoid this antipatterns and collect our original data following the rules in Notebook 7 and respecting at (a minimum) the 1NF rules. However, because I thought it may be of interest, in these notes, I will show how it can be possible to fix these antipatterns from within R. You will see that some are relatively easy and others are quite involves, even for these small examples.
Here is the data from the first antipattern, which is missing a feature name for the first column of the data:
<- read_csv("../data/antipattern01.csv")
ap ap
## # A tibble: 5 × 5
## ...1 country party birth_year year_started
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Sanna Marin Finland Social Democrats 1985 2019
## 2 Élisabeth Borne France LREM 1961 2022
## 3 Κυριάκος Μητσοτάκης Greece New Democracy 1968 2019
## 4 Katrín Jakobsdóttir Iceland Left-Green 1976 2017
## 5 Kaja Kallas Estonia Reform Party 1977 2021
Notice that R has given the feature the name ...1
; we
can rename this using the function rename
:
%>%
ap rename(name = ...1)
## # A tibble: 5 × 5
## name country party birth_year year_started
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Sanna Marin Finland Social Democrats 1985 2019
## 2 Élisabeth Borne France LREM 1961 2022
## 3 Κυριάκος Μητσοτάκης Greece New Democracy 1968 2019
## 4 Katrín Jakobsdóttir Iceland Left-Green 1976 2017
## 5 Kaja Kallas Estonia Reform Party 1977 2021
And that’s about it. Fixing one variable name is not a huge problem; the real challenge is if we had to do this for a large number of features.
In the second antipattern, we have a “*” in one of the numeric columns. When reading into R, the numeric column becomes a character column and we lose the ability to do any sort of numerical operations on it.
<- read_csv("../data/antipattern02.csv")
ap ap
## # A tibble: 5 × 5
## name country party birth_year year_started
## <chr> <chr> <chr> <chr> <dbl>
## 1 Sanna Marin Finland Social Democrats 1985 2019
## 2 Élisabeth Borne France LREM 1961 2022
## 3 Κυριάκος Μητσοτάκης Greece New Democracy 1968 2019
## 4 Katrín Jakobsdóttir Iceland Left-Green 1976* 2017
## 5 Kaja Kallas Estonia Reform Party 1977 2021
We can fix this by using the stringi package and a series of mutates to reconver the column to a numeric type while building a new notes column:
%>%
ap mutate(notes = stri_extract_first(birth_year, fixed = "*")) %>%
mutate(birth_year = stri_replace_all(birth_year, "", fixed = "*")) %>%
mutate(birth_year = as.integer(birth_year))
## # A tibble: 5 × 6
## name country party birth_year year_started notes
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 Sanna Marin Finland Social Democrats 1985 2019 <NA>
## 2 Élisabeth Borne France LREM 1961 2022 <NA>
## 3 Κυριάκος Μητσοτάκης Greece New Democracy 1968 2019 <NA>
## 4 Katrín Jakobsdóttir Iceland Left-Green 1976 2017 *
## 5 Kaja Kallas Estonia Reform Party 1977 2021 <NA>
A similar technique can be used if you have units of measurement of currency signs written into the data. As with the first example, fixing one column is not too difficult. The problem comes when trying to fix many columns at once.
The third antipatterns concerns two tables that are put on sheet. This is actually a fairly difficult thing to do directly in R. Reading in the data all together is a mess:
<- read_csv("../data/antipattern03.csv")
ap ap
## # A tibble: 19 × 7
## ...1 ...2 ...3 ...4 ...5 ...6 ...7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> Trial 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> ALPHA BETA GAMMA DELTA EPSILON ZETA
## 3 A 37 78 62 50 87 97
## 4 B 88 39 11 95 5 36
## 5 C 67 16 30 93 5 13
## 6 D 91 10 90 59 93 67
## 7 E 6 63 38 83 22 97
## 8 F 10 51 48 32 14 51
## 9 G 8 94 73 86 16 27
## 10 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 11 <NA> Trial 2 <NA> <NA> <NA> <NA> <NA>
## 12 <NA> ALPHA BETA GAMMA DELTA EPSILON ZETA
## 13 A 92 52 11 71 100 55
## 14 B 65 16 61 79 89 17
## 15 C 14 46 67 93 78 90
## 16 D 81 6 8 21 45 74
## 17 E 43 12 27 54 15 81
## 18 F 6 84 59 66 11 86
## 19 G 54 9 67 24 65 17
The trick is to use the options skip
and
n_max
to read each table into R separately. Here is the
first table:
<- read_csv("../data/antipattern03.csv", skip = 2, n_max = 7)
ap1 ap1
## # A tibble: 7 × 7
## ...1 ALPHA BETA GAMMA DELTA EPSILON ZETA
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 37 78 62 50 87 97
## 2 B 88 39 11 95 5 36
## 3 C 67 16 30 93 5 13
## 4 D 91 10 90 59 93 67
## 5 E 6 63 38 83 22 97
## 6 F 10 51 48 32 14 51
## 7 G 8 94 73 86 16 27
And here is the second:
<- read_csv("../data/antipattern03.csv", skip = 12, n_max = 7)
ap2 ap2
## # A tibble: 7 × 7
## ...1 ALPHA BETA GAMMA DELTA EPSILON ZETA
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 92 52 11 71 100 55
## 2 B 65 16 61 79 89 17
## 3 C 14 46 67 93 78 90
## 4 D 81 6 8 21 45 74
## 5 E 43 12 27 54 15 81
## 6 F 6 84 59 66 11 86
## 7 G 54 9 67 24 65 17
We can fix the bad variable name, put the tables together, and add the trial information through a series of data verbs like this:
%>%
ap1 mutate(trial = 1) %>%
bind_rows(ap2) %>%
mutate(trial = if_else(is.na(trial), 2, trial)) %>%
rename(latin = ...1)
## # A tibble: 14 × 8
## latin ALPHA BETA GAMMA DELTA EPSILON ZETA trial
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 37 78 62 50 87 97 1
## 2 B 88 39 11 95 5 36 1
## 3 C 67 16 30 93 5 13 1
## 4 D 91 10 90 59 93 67 1
## 5 E 6 63 38 83 22 97 1
## 6 F 10 51 48 32 14 51 1
## 7 G 8 94 73 86 16 27 1
## 8 A 92 52 11 71 100 55 2
## 9 B 65 16 61 79 89 17 2
## 10 C 14 46 67 93 78 90 2
## 11 D 81 6 8 21 45 74 2
## 12 E 43 12 27 54 15 81 2
## 13 F 6 84 59 66 11 86 2
## 14 G 54 9 67 24 65 17 2
If we wanted to create the even better long-format of the data described in the slides, we can add in a pivot and select function as well:
%>%
ap1 mutate(trial = 1) %>%
bind_rows(ap2) %>%
mutate(trial = if_else(is.na(trial), 2, trial)) %>%
rename(latin = ...1) %>%
pivot_longer(-c(latin, trial), names_to = "greek", values_to = "number") %>%
select(trial, latin, greek, number)
## # A tibble: 84 × 4
## trial latin greek number
## <dbl> <chr> <chr> <dbl>
## 1 1 A ALPHA 37
## 2 1 A BETA 78
## 3 1 A GAMMA 62
## 4 1 A DELTA 50
## 5 1 A EPSILON 87
## 6 1 A ZETA 97
## 7 1 B ALPHA 88
## 8 1 B BETA 39
## 9 1 B GAMMA 11
## 10 1 B DELTA 95
## # … with 74 more rows
And again, these things are doable with two tables that are otherwise
well structured. If the sheet contains dozens of individual tables, it
would be quite time-consuming to use this method to read in the data
into R. Note that the choices of the n_max
and
skip
functions required a bit of guess-and-check on my part
to get this working.
In this antipattern there are hierarchical column names. This is one of the worst things you can run into in a dataset because there is really no simply answer about how to fix the data in R. Let’s illustrate this by showing how we could do this from the example. Reading the table in all at once is a mess:
<- read_csv("../data/antipattern04.csv")
ap ap
## # A tibble: 6 × 7
## ...1 fall_semester ...3 ...4 spring_semester ...6 ...7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 student_name dsst289 math211 rhcs103 dsst389 math212 rhcs104
## 2 Sally 79 75 93 95 100 77
## 3 Bob 84 77 80 97 82 88
## 4 Jill 77 72 87 78 71 98
## 5 Jack 91 98 91 78 99 98
## 6 Mary 86 85 88 84 92 80
Let’s read in the data ignoring the first row as a starting point:
<- read_csv("../data/antipattern04.csv", skip = 1L)
ap_tbl ap_tbl
## # A tibble: 5 × 7
## student_name dsst289 math211 rhcs103 dsst389 math212 rhcs104
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sally 79 75 93 95 100 77
## 2 Bob 84 77 80 97 82 88
## 3 Jill 77 72 87 78 71 98
## 4 Jack 91 98 91 78 99 98
## 5 Mary 86 85 88 84 92 80
We can then pivot the data and get something that actually looks quite fine:
<- ap_tbl %>%
ap_tbl pivot_longer(-c(student_name), names_to = "course", values_to = "grade")
ap_tbl
## # A tibble: 30 × 3
## student_name course grade
## <chr> <chr> <dbl>
## 1 Sally dsst289 79
## 2 Sally math211 75
## 3 Sally rhcs103 93
## 4 Sally dsst389 95
## 5 Sally math212 100
## 6 Sally rhcs104 77
## 7 Bob dsst289 84
## 8 Bob math211 77
## 9 Bob rhcs103 80
## 10 Bob dsst389 97
## # … with 20 more rows
In this small example we could do some hand editing to figure out which rows are from the fall semester and which are from the spring semester. Let’s see if we can do this in a more programatic way that would sort of work for a larger table. Let’s read in the first row of column names as a seperate table:
<- read_csv("../data/antipattern04.csv", n_max = 1L)
ap_head ap_head
## # A tibble: 1 × 7
## ...1 fall_semester ...3 ...4 spring_semester ...6 ...7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 student_name dsst289 math211 rhcs103 dsst389 math212 rhcs104
Some manipulation, now, can turn this into a metadata table telling us when each class was offered:
<- ap_head %>%
ap_head select(-...1) %>%
pivot_longer(-c(), names_to = "semester", values_to = "course") %>%
mutate(semester = stri_replace(semester, "", fixed = "_semester")) %>%
mutate(semester = if_else(stri_sub(semester, 1, 3) == "...", NA_character_, semester)) %>%
fill(semester)
ap_head
## # A tibble: 6 × 2
## semester course
## <chr> <chr>
## 1 fall dsst289
## 2 fall math211
## 3 fall rhcs103
## 4 spring dsst389
## 5 spring math212
## 6 spring rhcs104
This can then be join into the larger dataset to get the version of the data we wished had been stored in the first place.
%>%
ap_tbl left_join(ap_head, by = "course")
## # A tibble: 30 × 4
## student_name course grade semester
## <chr> <chr> <dbl> <chr>
## 1 Sally dsst289 79 fall
## 2 Sally math211 75 fall
## 3 Sally rhcs103 93 fall
## 4 Sally dsst389 95 spring
## 5 Sally math212 100 spring
## 6 Sally rhcs104 77 spring
## 7 Bob dsst289 84 fall
## 8 Bob math211 77 fall
## 9 Bob rhcs103 80 fall
## 10 Bob dsst389 97 spring
## # … with 20 more rows
The technique here only works because there is a unique semester in which each class was held. Otherwise we’d need any even more involved technique to get the data into the right format. So, just please do not use hierarchical column names in your data collection!
The final antipattern is actually not that bad. In fact, while I never want my data to look like this when I am analyzing it, sometimes putting more than one thing in a value seperated by a common marker is actually an easy way to do data collection. Here is what the data look like in R:
<- read_csv("../data/antipattern05.csv")
ap ap
## # A tibble: 5 × 3
## student_name year fav_foods
## <chr> <chr> <chr>
## 1 Sally sophmore hotdogs; pizza; nachos
## 2 Bob sophmore bagels; sandwiches
## 3 Jill junior ice cream
## 4 Jack junior milkshakes; cookies
## 5 Mary senior ribeye; lobster; caviar
We can split apart the favorite foods using the
stri_split
function, but this results in a nested data
table (the kind of thing we discussed when talking about the 1NF
rules).
%>%
ap mutate(fav_foods = stri_split(fav_foods, fixed = "; "))
## # A tibble: 5 × 3
## student_name year fav_foods
## <chr> <chr> <list>
## 1 Sally sophmore <chr [3]>
## 2 Bob sophmore <chr [2]>
## 3 Jill junior <chr [1]>
## 4 Jack junior <chr [2]>
## 5 Mary senior <chr [3]>
The unnest
function comes to the rescue, though, and
produces a table that respects the first normal form:
%>%
ap mutate(fav_foods = stri_split(fav_foods, fixed = "; ")) %>%
unnest(c(fav_foods))
## # A tibble: 11 × 3
## student_name year fav_foods
## <chr> <chr> <chr>
## 1 Sally sophmore hotdogs
## 2 Sally sophmore pizza
## 3 Sally sophmore nachos
## 4 Bob sophmore bagels
## 5 Bob sophmore sandwiches
## 6 Jill junior ice cream
## 7 Jack junior milkshakes
## 8 Jack junior cookies
## 9 Mary senior ribeye
## 10 Mary senior lobster
## 11 Mary senior caviar
We could recover the 3NF by splitting the data into two tables using
the select
function like this:
%>%
ap select(student_name, year)
## # A tibble: 5 × 2
## student_name year
## <chr> <chr>
## 1 Sally sophmore
## 2 Bob sophmore
## 3 Jill junior
## 4 Jack junior
## 5 Mary senior
And this:
%>%
ap select(-year) %>%
mutate(fav_foods = stri_split(fav_foods, fixed = "; ")) %>%
unnest(c(fav_foods))
## # A tibble: 11 × 2
## student_name fav_foods
## <chr> <chr>
## 1 Sally hotdogs
## 2 Sally pizza
## 3 Sally nachos
## 4 Bob bagels
## 5 Bob sandwiches
## 6 Jill ice cream
## 7 Jack milkshakes
## 8 Jack cookies
## 9 Mary ribeye
## 10 Mary lobster
## 11 Mary caviar
Given last week’s exam, there are no homework questions for today.