Search code examples
rdplyrtransformationanalysis

Is there an efficient way (or a package) to dynamically filter a dataset by time between dates?


Say there is a dataset with duplicated PersonIDs, and multiple dates.

PersonID Date
1 2024-01-01
1 2024-01-02
1 2024-01-09
1 2024-01-15
2 2024-08-05
2 2024-08-06
3 2024-01-07
3 2024-01-08
3 2024-01-15

I want to keep only the first record for each PersonID, and filter out any records that is within a 3 day gap. Then, for the next record that is kept, remove any records that is within 3 days after that record as well. And reiterating that as many times as needed.

So in this example, only these will be left:

PersonID Date
1 2024-01-01
1 2024-01-09
1 2024-01-15
2 2024-08-05
3 2024-01-07
3 2024-01-15

Is there an efficient way of doing that? Or do I have to create a column one step at a time?


Solution

  • This type of windowing operation could be done with the help of Reduce. For example

    day_lag_fitler <- function(x, cutoff = 3) {
      Reduce(function(acc, current) {
        if (difftime(current, acc$ref, unit="days") < cutoff) {
          list(keep=FALSE, ref=acc$ref)
        } else {
          list(keep=TRUE, ref=current)
        }
      }, x[-1], init=list(keep=TRUE, ref=x[1]), accumulate = TRUE) |> sapply('[[', "keep")
    }
    

    We basically just keep a list of reference day and whether or not to keep the value and at the end we just peel off the keep part.

    And then using dplyr you can apply it to each of the person groups

    dd |>
      filter(day_lag_fitler(Date), .by=PersonID)
    #   PersonID       Date
    # 1        1 2024-01-01
    # 2        1 2024-01-09
    # 3        1 2024-01-15
    # 4        2 2024-01-16
    # 5        2 2024-09-01
    # 6        3 2024-01-07
    # 7        3 2024-01-15
    

    Tested with

    dd<- structure(list(PersonID = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L
    ), Date = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0), 
        min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L, 
        0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(1L, 2L, 9L, 15L, 
        16L, 1L, 7L, 8L, 15L), mon = c(0L, 0L, 0L, 0L, 0L, 8L, 0L, 
        0L, 0L), year = c(124L, 124L, 124L, 124L, 124L, 124L, 124L, 
        124L, 124L), wday = c(1L, 2L, 2L, 1L, 2L, 0L, 0L, 1L, 1L), 
        yday = c(0L, 1L, 8L, 14L, 15L, 244L, 6L, 7L, 14L), isdst = c(0L, 
        0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), zone = c("EST", "EST", "EST", 
        "EST", "EST", "EDT", "EST", "EST", "EST"), gmtoff = c(NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_)), class = c("POSIXlt", 
    "POSIXt"), tzone = c("", "EST", "EDT"), balanced = TRUE)), row.names = c(NA, 
    -9L), class = "data.frame")