Search code examples
rdata.table

dates and overlapping dates function


I am trying to create a function that takes overlapping periods into account in this ordered data by id, start_ins, rxdate:

dt <- data.table(
    id = c(1, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-08", "2003-12-29", "2004-01-12", "2004-03-10")),
    amount = c(30, 28, 100, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-07-17", "2004-02-27", "2004-01-13", "2004-03-20"))
)

In this set, observations are grouped by id and start_ins. People also purchase Rx on a date and the number of pills (amount). rx_end is

[, rx_end := rxdate + amount]

However, if you look closely, you will find that rx_end for line 2 comes after rxdate for line 3. In this case, there is an overlap between rxdate line 3 and end of previous prescription. I want to assume that a person finished a rx before starting a new rx. In this case rxdate for line 3 should start the day after rx_end of line 2 AND rx_end for line 3 should be updated to take into account the new rxdate in line 3. Same logic applies for lines 4 and 5. In this case, the new data set would look something like this:

dt1 <- data.table(
    id = c(1, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-28", "2003-12-29", "2004-02-28", "2004-03-10")),
    amount = c(30, 28, 100, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-08-06", "2004-02-27", "2004-02-29", "2004-03-20"))
)

I tried writing this function using data.table:

overlapper <- function (dt){
dt[,{
prev_end_date <- shift(rx_end, 1, type = "lag")

for (i in 2:.N){
if(!is.na(prev_end_date[i-1]) & prev_end_date >= dt[i, rxdate]){
dt[i, rxdate := prev_end_date[i-1] +1]
dt[i, rx_end := rxdate + amount]
}
prev_end_date[i] <- dt[i, rx_end]
}
return(dt)}, by = .(id, start_ins)]}

Keep in mind that some people have only one observation, so these should not be taken into account and that is why I am doing a grouped operation by id and start_ins. I always want to look at line n and compare it to n-1 to update line n ONLY if there is an overlap. If there is no over lap, I would keep the line as is and move on to the next line.


Solution

  • You can do this using data.table::shift().

    1. Calculate the duration of each prescription and put it in rx_days.
    2. If there's an overlap with the previous prescription (by ID) move the start date to the day after the previous end date.
    3. Make sure we never replace the first row in a group (we need to state this explicitly given your note in the comment about the change from < to <=).
    4. Adjust the end dates based on the duration and the new start dates and remove the temporary variable rx_days.
    5. Continue until this is stable.

    Here are steps 1-4, which work with your sample data:

    dt[, c("rxdate", "rx_end") := {
        i <- seq(.N) # never replace first row
        shifted_rx_end <- shift(rx_end, 1, fill = rxdate[1])
        shift_dates <- fifelse(i == 1, FALSE, rxdate <= shifted_rx_end)
        rxdate <- fifelse(shift_dates, shifted_rx_end + 1, rxdate)
        rx_end[shift_dates] <- rxdate[shift_dates] + amount[shift_dates]
        .(rxdate, rx_end)
    }, .(id, start_ins)]
    
    identical(dt, dt1)
    # [1] TRUE
    

    However, imagine this new data frame:

          id  start_ins     rxdate amount     rx_end
       <num>     <Date>     <Date>  <num>     <Date>
    1:     1 2000-10-12 2010-01-04     30 2010-02-03
    2:     2 2000-03-31 2010-03-30     28 2010-04-27
    3:     2 2000-03-31 2010-04-08    100 2010-07-17
    4:     2 2000-03-31 2010-07-20     10 2010-07-30 # new row
    5:     2 2000-03-31 2010-08-15     17 2010-09-01 # new row
    6:     3 2000-01-11 2003-12-29     60 2004-02-27
    7:     3 2000-01-11 2004-01-12      1 2004-01-13
    8:     3 2000-01-11 2004-03-10     10 2004-03-20
    

    If we apply this approach, row 3 will update correctly but row 4 will not reflect the changes in row 3, and so does not trigger what should be an update in row 5.

    To resolve this we can put it in a recursive function that calls itself until there are no more changes:

    shift_dates <- function(dt, grp = c("id", "start_ins")) {
        old_state <- dt[, .(rxdate, rx_end)]
        dt[, c("rxdate", "rx_end") := {
            i <- seq(.N) # never replace first row
            shifted_rx_end <- shift(rx_end, 1, fill = rxdate[1])
            shift_dates <- fifelse(i == 1, FALSE, rxdate <= shifted_rx_end)
            rxdate <- fifelse(shift_dates, shifted_rx_end + 1, rxdate)
            rx_end[shift_dates] <- rxdate[shift_dates] + amount[shift_dates]
            .(rxdate, rx_end)
        }, grp]
    
        if (identical(dt[, .(rxdate, rx_end)], old_state)) {
            return(dt)
        }
        shift_dates(dt)
    }
    

    This means the changes from previous rows will propagate until complete:

    shift_dates(dt2)
          id  start_ins     rxdate amount     rx_end
       <num>     <Date>     <Date>  <num>     <Date>
    1:     1 2000-10-12 2010-01-04     30 2010-02-03
    2:     2 2000-03-31 2010-03-30     28 2010-04-27
    3:     2 2000-03-31 2010-04-28    100 2010-08-06
    4:     2 2000-03-31 2010-08-07     10 2010-08-17 # change propagates
    5:     2 2000-03-31 2010-08-18     17 2010-09-04 # change propagates
    6:     3 2000-01-11 2003-12-29     60 2004-02-27
    7:     3 2000-01-11 2004-02-28      1 2004-02-29
    8:     3 2000-01-11 2004-03-10     10 2004-03-20
    

    New data

    dt2 <- data.table(
        id = c(1, 2, 2, 2, 2, 3, 3, 3),
        start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
        rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-08", "2010-07-20", "2010-08-15", "2003-12-29", "2004-01-12", "2004-03-10")),
        amount = c(30, 28, 100, 10, 17, 60, 1, 10),
        rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-07-17", "2010-07-30", "2010-09-01", "2004-02-27", "2004-01-13", "2004-03-20"))
    )