Search code examples
rdatetimedate-range

Find all date ranges for overlapping start and end dates in R


I have a data frame that looks like this:

w<-read.table(header=TRUE,text="
start.date   end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")

I'm trying to get an output that will combine overlapping start and end dates into one date range. So for the example set, I'd like to get:

w<-read.table(header=TRUE,text="
start.date   end.date
2006-06-26 2006-08-16
2007-06-09 2007-07-31
2007-08-04 2007-09-04
2007-09-05 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-08-20")

The question is similar to Date roll-up in R, but I don't need to do any sort of group by on mine, so the answer there is confusing.

Also, the code that was suggested in response to my question below does not work for certain parts of my data frame such as:

x<-read.table(header=TRUE,text="start.date   end.date
2006-01-19 2006-01-20
2006-01-25 2006-01-29
2006-02-24 2006-02-25
2006-03-15 2006-03-22
2006-04-29 2006-04-30
2006-05-24 2006-05-25
2006-06-26 2006-08-16
2006-07-05 2006-07-10
2006-07-12 2006-07-21
2006-08-13 2006-08-15
2006-08-18 2006-08-19
2006-08-28 2006-09-02")

I am confused why it does not?


Solution

  • Try this:

    w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
    w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
    w$group <- 1:nrow(w) # common intervals should belong to same group
    merge.indices <- lapply(2:nrow(w), function(x) {
                        indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
                        if (length(indices) > 0) indices <- c(indices, x) 
                        indices})
    # assign the intervals the right groups
    for (i in 1:length(merge.indices)) {
      if (length(merge.indices[[i]]) > 0) {
        w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
      }
    }
    
    do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))
    

    It conceptually merges overlapping intervals into the same group as shown below: enter image description here

    with output:

       start.date   end.date
    1  2006-01-19 2006-01-20
    2  2006-01-25 2006-01-29
    3  2006-02-24 2006-02-25
    4  2006-03-15 2006-03-22
    5  2006-04-29 2006-04-30
    6  2006-05-24 2006-05-25
    7  2006-06-26 2006-08-16
    11 2006-08-18 2006-08-19
    12 2006-08-28 2006-09-02