Search code examples
rdateoverlapping

Compare date intervals within the same data frame


I have search around and find similar questions but can make it work for my data.

I have a data frame with start and end dates, as well as several other factors. Ideally, the start date of a row should be posterior to the end date of any previous row, but the data has duplicated starts or ends, and sometimes the interval of the dates overlap.

I tried to make a reproducible example:

df = data.frame(start=c("2018/04/15 9:00:00","2018/04/15 9:00:00","2018/04/16 10:20:00","2018/04/16 15:30:00",
                   "2018/04/17 12:40:00","2018/04/17 18:50:00"),
                end=c("2018/04/16 8:00:00","2018/04/16 7:10:00","2018/04/17 18:20:00","2018/04/16 16:30:00",
                   "2018/04/17 16:40:00","2018/04/17 19:50:00"),
                value=c(10,15,11,13,14,12))

I was able to remove the duplicated (end or start dates), but I can't remove the overlapping intervals. I want to create a loop that "cleans" the intervals contained within any larger interval. So the results looks like this:

result = df[c(1,3,6),]

I thought I could make a loop that would "clean" both duplicates and overlapping intervals, but I can't make it work.

Any suggestions?


Solution

  • The data.table package is suited for this kind of problem using the overlapping join function foverlaps (inspired by findOverlaps function from the Bioconductor package IRanges) and then an anti-join (data.table syntax is B[!A, on]) to remove those inner intervals.

    library(data.table)
    cols <- c("start", "end")
    setDT(df)
    df[, (cols) := lapply(.SD, function(x) as.POSIXct(x, format="%Y/%m/%d %H:%M:%S")), .SDcols=cols]
    setkeyv(df, cols)
    anti <- foverlaps(df, df, type="within")[start!=i.start | end!=i.end | value!=i.value]
    df[!anti, on=.(start=i.start, end=i.end, value=i.value)]
    
    #                  start                 end value
    # 1: 2018-04-15 09:00:00 2018-04-16 08:00:00    10
    # 2: 2018-04-16 10:20:00 2018-04-17 18:20:00    11
    # 3: 2018-04-17 18:50:00 2018-04-17 19:50:00    12