I have a data frame that looks like this:
w<-read.table(header=TRUE,text="
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
I'm trying to get an output that will combine overlapping start and end dates into one date range. So for the example set, I'd like to get:
w<-read.table(header=TRUE,text="
start.date end.date
2006-06-26 2006-08-16
2007-06-09 2007-07-31
2007-08-04 2007-09-04
2007-09-05 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-08-20")
The question is similar to Date roll-up in R, but I don't need to do any sort of group by on mine, so the answer there is confusing.
Also, the code that was suggested in response to my question below does not work for certain parts of my data frame such as:
x<-read.table(header=TRUE,text="start.date end.date
2006-01-19 2006-01-20
2006-01-25 2006-01-29
2006-02-24 2006-02-25
2006-03-15 2006-03-22
2006-04-29 2006-04-30
2006-05-24 2006-05-25
2006-06-26 2006-08-16
2006-07-05 2006-07-10
2006-07-12 2006-07-21
2006-08-13 2006-08-15
2006-08-18 2006-08-19
2006-08-28 2006-09-02")
I am confused why it does not?
Try this:
w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
w$group <- 1:nrow(w) # common intervals should belong to same group
merge.indices <- lapply(2:nrow(w), function(x) {
indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
if (length(indices) > 0) indices <- c(indices, x)
indices})
# assign the intervals the right groups
for (i in 1:length(merge.indices)) {
if (length(merge.indices[[i]]) > 0) {
w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
}
}
do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))
It conceptually merges overlapping intervals into the same group as shown below:
with output:
start.date end.date
1 2006-01-19 2006-01-20
2 2006-01-25 2006-01-29
3 2006-02-24 2006-02-25
4 2006-03-15 2006-03-22
5 2006-04-29 2006-04-30
6 2006-05-24 2006-05-25
7 2006-06-26 2006-08-16
11 2006-08-18 2006-08-19
12 2006-08-28 2006-09-02