I have a data table where each column represents an event: if the event happened, there is a date value, if it did not happen, it's empty. Now, all events are optional, but if they happen, they have to follow an order (A, then B, C...).
Exploring the data, I have seen that there are at least a couple of data quality issues: e.g. event A empty, event B has a date: or event A has a later date than event B. I have to check 10 columns in 1000+ rows, so I was wondering if there was a way to automate this with R (I only need to mark if the sequence is OK or not, to then check the wrong cases manually)... The only thing I can think of is doing a lot of ifelse nested statements which does not seem appropriate at all.
Does someone know a better function/approach? Thanks in advance, here is some dummy data: (following events can have the same date)
> dput(Book1)
structure(list(ID = 1:20, A = structure(c(17532, NA, NA, 17226,
17498, 17204, 17646, 17567, 17609, 17259, 17606, 17606, 17567,
17612, 17612, 17612, 17395, 17687, 17612, 17687), class = "Date"),
B = structure(c(17567, 17716, NA, 17259, 17562, NA, 17651,
17606, 17612, 17226, NA, 17681, NA, NA, NA, NA, 17407, 17687,
NA, 17716), class = "Date"), C = structure(c(NA, NA, NA,
17260, NA, NA, NA, NA, 17614, NA, NA, 17687, NA, 17687, NA,
NA, NA, NA, NA, 17716), class = "Date"), D = structure(c(NA,
NA, NA, 17407, NA, NA, NA, NA, 17625, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), class = "Date"), E = structure(c(NA,
NA, NA, 17606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), class = "Date")), .Names = c("ID", "A",
"B", "C", "D", "E"), row.names = c(NA, -20L), spec = structure(list(
cols = structure(list(ID = structure(list(), class = c("collector_integer",
"collector")), A = structure(list(), class = c("collector_character",
"collector")), B = structure(list(), class = c("collector_character",
"collector")), C = structure(list(), class = c("collector_character",
"collector")), D = structure(list(), class = c("collector_character",
"collector")), E = structure(list(), class = c("collector_character",
"collector"))), .Names = c("ID", "A", "B", "C", "D", "E")),
default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class =
c("tbl_df",
"tbl", "data.frame"))
So, in this example, rows 2, 10, and 14 should be flagged.
Thanks in advance
I would do this in data.table
but I'm sure the dplyr
version is similar:
library(data.table)
setDT(DF) # <- convert to data.table
DF[DF[ , melt(.SD, id.vars = 'ID')
][ , {
non_na_idx = which(!is.na(value))
any(diff(value) < 0, na.rm = TRUE) ||
(length(non_na_idx) &&
max(non_na_idx) != length(non_na_idx))
}, keyby = ID],
flag := i.V1, on = 'ID'][]
# ID A B C D E flag
# 1: 1 2018-01-01 2018-02-05 <NA> <NA> <NA> FALSE
# 2: 2 <NA> 2018-07-04 <NA> <NA> <NA> TRUE
# 3: 3 <NA> <NA> <NA> <NA> <NA> FALSE
# 4: 4 2017-03-01 2017-04-03 2017-04-04 2017-08-29 2018-03-16 FALSE
# 5: 5 2017-11-28 2018-01-31 <NA> <NA> <NA> FALSE
# 6: 6 2017-02-07 <NA> <NA> <NA> <NA> FALSE
# 7: 7 2018-04-25 2018-04-30 <NA> <NA> <NA> FALSE
# 8: 8 2018-02-05 2018-03-16 <NA> <NA> <NA> FALSE
# 9: 9 2018-03-19 2018-03-22 2018-03-24 2018-04-04 <NA> FALSE
# 10: 10 2017-04-03 2017-03-01 <NA> <NA> <NA> TRUE
# 11: 11 2018-03-16 <NA> <NA> <NA> <NA> FALSE
# 12: 12 2018-03-16 2018-05-30 2018-06-05 <NA> <NA> FALSE
# 13: 13 2018-02-05 <NA> <NA> <NA> <NA> FALSE
# 14: 14 2018-03-22 <NA> 2018-06-05 <NA> <NA> TRUE
# 15: 15 2018-03-22 <NA> <NA> <NA> <NA> FALSE
# 16: 16 2018-03-22 <NA> <NA> <NA> <NA> FALSE
# 17: 17 2017-08-17 2017-08-29 <NA> <NA> <NA> FALSE
# 18: 18 2018-06-05 2018-06-05 <NA> <NA> <NA> FALSE
# 19: 19 2018-03-22 <NA> <NA> <NA> <NA> FALSE
# 20: 20 2018-06-05 2018-07-04 2018-07-04 <NA> <NA> FALSE
apply
-style answers will force coercion of your table to a matrix, which can come with some unexpected side effects (and be slow, for much larger examples), so I elected to reshape your data long -- addressing your issues on the long form of your data is much simpler, I reckon.
The reshaping is done with melt
:
DF[ , melt(.SD, id.vars = 'ID')]
# ID variable value
# 1: 1 A 2018-01-01
# 2: 2 A <NA>
# 3: 3 A <NA>
# 4: 4 A 2017-03-01
# 5: 5 A 2017-11-28
# 6: 6 A 2017-02-07
# 7: 7 A 2018-04-25
# 8: 8 A 2018-02-05
# 9: 9 A 2018-03-19
# 10: 10 A 2017-04-03
# < more rows here >
# 91: 11 E <NA>
# 92: 12 E <NA>
# 93: 13 E <NA>
# 94: 14 E <NA>
# 95: 15 E <NA>
# 96: 16 E <NA>
# 97: 17 E <NA>
# 98: 18 E <NA>
# 99: 19 E <NA>
# 100: 20 E <NA>
# ID variable value
You have two conditions you're looking for --
In any row, no date at a higher column (ordered by letter) should come before one in a lower column. In the data's long form, this means that the successive differences win each ID
should be monotonically increasing, or equivalently, that diff(value)
is always nonnegative. Hence, our flag
is TRUE
if any(diff(value) < 0, na.rm = TRUE)
, meaning at least one such difference was negative for this ID
:
DF[ , melt(.SD, id.vars = 'ID')
][ , any(diff(na.omit(value)) < 0, na.rm = TRUE),
keyby = ID]
# ID V1
# 1: 1 FALSE
# < omitted; all FALSE >
# 9: 9 FALSE
# 10: 10 TRUE # <- column B comes before column A
# 11: 11 FALSE
# < omitted; all FALSE >
# 20: 20 FALSE
Once a column "goes missing", it should "stay missing", meaning there should be no NA
gaps between observed values. This is equivalent to saying (a) there's at least one non-missing value in the row and (b) the number of non-missing elements is the same as the column number of the highest non-missing column:
DF[ , melt(.SD, id.vars = 'ID')
][ , {
non_na_idx = which(!is.na(value))
length(non_na_idx) && max(non_na_idx) != length(non_na_idx)
}, keyby = ID]
# ID V1
# 1: 1 FALSE
# 2: 2 TRUE # <- Column A missing, B not
# 3: 3 FALSE
# < omitted; all FALSE >
# 13: 13 FALSE
# 14: 14 TRUE # <- Column B missing, C not
# 15: 15 FALSE
# < omitted; all FALSE >
# 20: 20 FALSE
Combine these two conditions to get the flag for all three rows.
Lastly, we join the newly created flag back to the original table, and create a column called flag
. This could be separated to two steps -- create the table with the flag column, then join:
DF_with_flag =
DF[ , melt(.SD, id.vars = 'ID')
][ , {
non_na_idx = which(!is.na(value))
any(diff(na.omit(value)) < 0, na.rm = TRUE) ||
(length(non_na_idx) &&
max(non_na_idx) != length(non_na_idx))
}, keyby = ID]
DF[DF_with_flag, flag := i.V1, on = 'ID']