Search code examples
rdatedata-cleaning

How to replace incorrect dates by using data from a different column in R


I have a large medical dataset that I am cleaning. Some of the dates for operations are certainly improbable (a century before the date of admission!). Below is an example df:

doa = c("2010-08-01", "2010-08-07", "2008-09-01") # date of admission
doo_1 = c("2010-08-02", "1900-01-01", "2008-09-03") # date of first op
doo_2 = c("1800-01-01", "1900-01-01", "2008-09-03") # date of second op
df= data.frame (doa, doo_1, doo_2)

I want to replace the values in columns 2 and 3 (in real dataset there are 25 columns) when it is less than doa with doa+1 (as operations tend to happen day following admission).


Solution

  • The OP has requested to replace improbable dates of operations by doa + 1 (next day after admission).

    If this simple rule is to be applied to all 25 columns it is not necessary to reshape the dataset from wide to long format and back to wide format again.

    library(data.table)
    
    # coerce date columns from factor or character to IDate
    date_cols <- c("doa", "doo_1", "doo_2")
    setDT(df)[, (date_cols) := lapply(.SD, as.IDate), .SDcols = date_cols][]
    # replace dates if improbable
    df[doo_1 < doa, doo_1 := doa + 1L][]
    df[doo_2 < doa, doo_2 := doa + 1L][]
    
              doa      doo_1      doo_2
    1: 2010-08-01 2010-08-02 2010-08-02
    2: 2010-08-07 2010-08-08 2010-08-08
    3: 2008-09-01 2008-09-03 2008-09-03
    

    The OP has mentioned that the production datasets contains 25 columns. This requires a more flexible solution where the column names to work on are not hardcoded but are supplied in a vector:

    op_cols <- c("doo_1", "doo_2")
    for (x in op_cols) {
      df[get(x) < doa, (x) := doa + 1L]
    }
    df[]
    

    Note that data.table updates by reference, i.e., without copying the whole data object but only the selected elements.