dates and overlapping dates function

I am trying to create a function that takes overlapping periods into account in this ordered data by id, start_ins, rxdate:

dt <- data.table(
    id = c(1, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-08", "2003-12-29", "2004-01-12", "2004-03-10")),
    amount = c(30, 28, 100, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-07-17", "2004-02-27", "2004-01-13", "2004-03-20"))
)

In this set, observations are grouped by id and start_ins. People also purchase Rx on a date and the number of pills (amount). rx_end is

[, rx_end := rxdate + amount]

However, if you look closely, you will find that rx_end for line 2 comes after rxdate for line 3. In this case, there is an overlap between rxdate line 3 and end of previous prescription. I want to assume that a person finished a rx before starting a new rx. In this case rxdate for line 3 should start the day after rx_end of line 2 AND rx_end for line 3 should be updated to take into account the new rxdate in line 3. Same logic applies for lines 4 and 5. In this case, the new data set would look something like this:

dt1 <- data.table(
    id = c(1, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-28", "2003-12-29", "2004-02-28", "2004-03-10")),
    amount = c(30, 28, 100, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-08-06", "2004-02-27", "2004-02-29", "2004-03-20"))
)

I tried writing this function using data.table:

overlapper <- function (dt){
dt[,{
prev_end_date <- shift(rx_end, 1, type = "lag")

for (i in 2:.N){
if(!is.na(prev_end_date[i-1]) & prev_end_date >= dt[i, rxdate]){
dt[i, rxdate := prev_end_date[i-1] +1]
dt[i, rx_end := rxdate + amount]
}
prev_end_date[i] <- dt[i, rx_end]
}
return(dt)}, by = .(id, start_ins)]}

Keep in mind that some people have only one observation, so these should not be taken into account and that is why I am doing a grouped operation by id and start_ins. I always want to look at line n and compare it to n-1 to update line n ONLY if there is an overlap. If there is no over lap, I would keep the line as is and move on to the next line.

Solution

You can do this using data.table::shift().

Calculate the duration of each prescription and put it in rx_days.
If there's an overlap with the previous prescription (by ID) move the start date to the day after the previous end date.
Make sure we never replace the first row in a group (we need to state this explicitly given your note in the comment about the change from < to <=).
Adjust the end dates based on the duration and the new start dates and remove the temporary variable rx_days.
Continue until this is stable.

Here are steps 1-4, which work with your sample data:

dt[, c("rxdate", "rx_end") := {
    i <- seq(.N) # never replace first row
    shifted_rx_end <- shift(rx_end, 1, fill = rxdate[1])
    shift_dates <- fifelse(i == 1, FALSE, rxdate <= shifted_rx_end)
    rxdate <- fifelse(shift_dates, shifted_rx_end + 1, rxdate)
    rx_end[shift_dates] <- rxdate[shift_dates] + amount[shift_dates]
    .(rxdate, rx_end)
}, .(id, start_ins)]

identical(dt, dt1)
# [1] TRUE

However, imagine this new data frame:

      id  start_ins     rxdate amount     rx_end
   <num>     <Date>     <Date>  <num>     <Date>
1:     1 2000-10-12 2010-01-04     30 2010-02-03
2:     2 2000-03-31 2010-03-30     28 2010-04-27
3:     2 2000-03-31 2010-04-08    100 2010-07-17
4:     2 2000-03-31 2010-07-20     10 2010-07-30 # new row
5:     2 2000-03-31 2010-08-15     17 2010-09-01 # new row
6:     3 2000-01-11 2003-12-29     60 2004-02-27
7:     3 2000-01-11 2004-01-12      1 2004-01-13
8:     3 2000-01-11 2004-03-10     10 2004-03-20

If we apply this approach, row 3 will update correctly but row 4 will not reflect the changes in row 3, and so does not trigger what should be an update in row 5.

To resolve this we can put it in a recursive function that calls itself until there are no more changes:

shift_dates <- function(dt, grp = c("id", "start_ins")) {
    old_state <- dt[, .(rxdate, rx_end)]
    dt[, c("rxdate", "rx_end") := {
        i <- seq(.N) # never replace first row
        shifted_rx_end <- shift(rx_end, 1, fill = rxdate[1])
        shift_dates <- fifelse(i == 1, FALSE, rxdate <= shifted_rx_end)
        rxdate <- fifelse(shift_dates, shifted_rx_end + 1, rxdate)
        rx_end[shift_dates] <- rxdate[shift_dates] + amount[shift_dates]
        .(rxdate, rx_end)
    }, grp]

    if (identical(dt[, .(rxdate, rx_end)], old_state)) {
        return(dt)
    }
    shift_dates(dt)
}

This means the changes from previous rows will propagate until complete:

shift_dates(dt2)
      id  start_ins     rxdate amount     rx_end
   <num>     <Date>     <Date>  <num>     <Date>
1:     1 2000-10-12 2010-01-04     30 2010-02-03
2:     2 2000-03-31 2010-03-30     28 2010-04-27
3:     2 2000-03-31 2010-04-28    100 2010-08-06
4:     2 2000-03-31 2010-08-07     10 2010-08-17 # change propagates
5:     2 2000-03-31 2010-08-18     17 2010-09-04 # change propagates
6:     3 2000-01-11 2003-12-29     60 2004-02-27
7:     3 2000-01-11 2004-02-28      1 2004-02-29
8:     3 2000-01-11 2004-03-10     10 2004-03-20

New data

dt2 <- data.table(
    id = c(1, 2, 2, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-08", "2010-07-20", "2010-08-15", "2003-12-29", "2004-01-12", "2004-03-10")),
    amount = c(30, 28, 100, 10, 17, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-07-17", "2010-07-30", "2010-09-01", "2004-02-27", "2004-01-13", "2004-03-20"))
)