Search code examples
rdate-comparison

How can I check if a sequence of events is in order?


I have a data table where each column represents an event: if the event happened, there is a date value, if it did not happen, it's empty. Now, all events are optional, but if they happen, they have to follow an order (A, then B, C...).

Exploring the data, I have seen that there are at least a couple of data quality issues: e.g. event A empty, event B has a date: or event A has a later date than event B. I have to check 10 columns in 1000+ rows, so I was wondering if there was a way to automate this with R (I only need to mark if the sequence is OK or not, to then check the wrong cases manually)... The only thing I can think of is doing a lot of ifelse nested statements which does not seem appropriate at all.

Does someone know a better function/approach? Thanks in advance, here is some dummy data: (following events can have the same date)

> dput(Book1)
structure(list(ID = 1:20, A = structure(c(17532, NA, NA, 17226, 
17498, 17204, 17646, 17567, 17609, 17259, 17606, 17606, 17567, 
17612, 17612, 17612, 17395, 17687, 17612, 17687), class = "Date"), 
B = structure(c(17567, 17716, NA, 17259, 17562, NA, 17651, 
17606, 17612, 17226, NA, 17681, NA, NA, NA, NA, 17407, 17687, 
NA, 17716), class = "Date"), C = structure(c(NA, NA, NA, 
17260, NA, NA, NA, NA, 17614, NA, NA, 17687, NA, 17687, NA, 
NA, NA, NA, NA, 17716), class = "Date"), D = structure(c(NA, 
NA, NA, 17407, NA, NA, NA, NA, 17625, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), class = "Date"), E = structure(c(NA, 
NA, NA, 17606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), class = "Date")), .Names = c("ID", "A", 
"B", "C", "D", "E"), row.names = c(NA, -20L), spec = structure(list(
cols = structure(list(ID = structure(list(), class = c("collector_integer", 
"collector")), A = structure(list(), class = c("collector_character", 
"collector")), B = structure(list(), class = c("collector_character", 
"collector")), C = structure(list(), class = c("collector_character", 
"collector")), D = structure(list(), class = c("collector_character", 
"collector")), E = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("ID", "A", "B", "C", "D", "E")), 
default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class = 
 c("tbl_df", 
"tbl", "data.frame"))

So, in this example, rows 2, 10, and 14 should be flagged.

Thanks in advance


Solution

  • I would do this in data.table but I'm sure the dplyr version is similar:

    library(data.table)
    setDT(DF) # <- convert to data.table
    DF[DF[ , melt(.SD, id.vars = 'ID')
           ][ , {
             non_na_idx = which(!is.na(value))
             any(diff(value) < 0, na.rm = TRUE) || 
               (length(non_na_idx) && 
                  max(non_na_idx) != length(non_na_idx))
           }, keyby = ID],
       flag := i.V1, on = 'ID'][]
    #     ID          A          B          C          D          E  flag
    #  1:  1 2018-01-01 2018-02-05       <NA>       <NA>       <NA> FALSE
    #  2:  2       <NA> 2018-07-04       <NA>       <NA>       <NA>  TRUE
    #  3:  3       <NA>       <NA>       <NA>       <NA>       <NA> FALSE
    #  4:  4 2017-03-01 2017-04-03 2017-04-04 2017-08-29 2018-03-16 FALSE
    #  5:  5 2017-11-28 2018-01-31       <NA>       <NA>       <NA> FALSE
    #  6:  6 2017-02-07       <NA>       <NA>       <NA>       <NA> FALSE
    #  7:  7 2018-04-25 2018-04-30       <NA>       <NA>       <NA> FALSE
    #  8:  8 2018-02-05 2018-03-16       <NA>       <NA>       <NA> FALSE
    #  9:  9 2018-03-19 2018-03-22 2018-03-24 2018-04-04       <NA> FALSE
    # 10: 10 2017-04-03 2017-03-01       <NA>       <NA>       <NA>  TRUE
    # 11: 11 2018-03-16       <NA>       <NA>       <NA>       <NA> FALSE
    # 12: 12 2018-03-16 2018-05-30 2018-06-05       <NA>       <NA> FALSE
    # 13: 13 2018-02-05       <NA>       <NA>       <NA>       <NA> FALSE
    # 14: 14 2018-03-22       <NA> 2018-06-05       <NA>       <NA>  TRUE
    # 15: 15 2018-03-22       <NA>       <NA>       <NA>       <NA> FALSE
    # 16: 16 2018-03-22       <NA>       <NA>       <NA>       <NA> FALSE
    # 17: 17 2017-08-17 2017-08-29       <NA>       <NA>       <NA> FALSE
    # 18: 18 2018-06-05 2018-06-05       <NA>       <NA>       <NA> FALSE
    # 19: 19 2018-03-22       <NA>       <NA>       <NA>       <NA> FALSE
    # 20: 20 2018-06-05 2018-07-04 2018-07-04       <NA>       <NA> FALSE
    

    apply-style answers will force coercion of your table to a matrix, which can come with some unexpected side effects (and be slow, for much larger examples), so I elected to reshape your data long -- addressing your issues on the long form of your data is much simpler, I reckon.

    The reshaping is done with melt:

    DF[ , melt(.SD, id.vars = 'ID')]
    #      ID variable      value
    #   1:  1        A 2018-01-01
    #   2:  2        A       <NA>
    #   3:  3        A       <NA>
    #   4:  4        A 2017-03-01
    #   5:  5        A 2017-11-28
    #   6:  6        A 2017-02-07
    #   7:  7        A 2018-04-25
    #   8:  8        A 2018-02-05
    #   9:  9        A 2018-03-19
    #  10: 10        A 2017-04-03
    # < more rows here >
    #  91: 11        E       <NA>
    #  92: 12        E       <NA>
    #  93: 13        E       <NA>
    #  94: 14        E       <NA>
    #  95: 15        E       <NA>
    #  96: 16        E       <NA>
    #  97: 17        E       <NA>
    #  98: 18        E       <NA>
    #  99: 19        E       <NA>
    # 100: 20        E       <NA>
    #      ID variable      value
    

    You have two conditions you're looking for --

    In any row, no date at a higher column (ordered by letter) should come before one in a lower column. In the data's long form, this means that the successive differences win each ID should be monotonically increasing, or equivalently, that diff(value) is always nonnegative. Hence, our flag is TRUE if any(diff(value) < 0, na.rm = TRUE), meaning at least one such difference was negative for this ID:

    DF[ , melt(.SD, id.vars = 'ID')
        ][ , any(diff(na.omit(value)) < 0, na.rm = TRUE), 
           keyby = ID]
    #     ID    V1
    #  1:  1 FALSE
    # < omitted; all FALSE >
    #  9:  9 FALSE
    # 10: 10  TRUE # <- column B comes before column A
    # 11: 11 FALSE
    # < omitted; all FALSE >
    # 20: 20 FALSE
    

    Once a column "goes missing", it should "stay missing", meaning there should be no NA gaps between observed values. This is equivalent to saying (a) there's at least one non-missing value in the row and (b) the number of non-missing elements is the same as the column number of the highest non-missing column:

    DF[ , melt(.SD, id.vars = 'ID')
        ][ , {
          non_na_idx = which(!is.na(value))
          length(non_na_idx) && max(non_na_idx) != length(non_na_idx)
        }, keyby = ID]
    #     ID    V1
    #  1:  1 FALSE
    #  2:  2  TRUE # <- Column A missing, B not
    #  3:  3 FALSE
    # < omitted; all FALSE >
    # 13: 13 FALSE
    # 14: 14  TRUE # <- Column B missing, C not
    # 15: 15 FALSE
    # < omitted; all FALSE >
    # 20: 20 FALSE
    

    Combine these two conditions to get the flag for all three rows.

    Lastly, we join the newly created flag back to the original table, and create a column called flag. This could be separated to two steps -- create the table with the flag column, then join:

    DF_with_flag = 
      DF[ , melt(.SD, id.vars = 'ID')
          ][ , {
            non_na_idx = which(!is.na(value))
            any(diff(na.omit(value)) < 0, na.rm = TRUE) || 
              (length(non_na_idx) && 
                 max(non_na_idx) != length(non_na_idx))
          }, keyby = ID]
    DF[DF_with_flag, flag := i.V1, on = 'ID']