Search code examples
rdplyrtidyroverlap

Separate overlaps in start and end range into own row of data frame


I have this data:

start_data <- data.frame(stringsAsFactors=FALSE,
                          Person = c(1, 1, 1, 1),
                          Event = c(1, 2, 3, 4),
                          Var1 = c(1, 2, 3, 5),
                          Var2 = c(7, 8, 9, 6),
                          Var3 = c(13, 14, 15, 7),
                          Start_Date = c("1/01/2020", "5/01/2020", "21/01/2020", "23/01/2020"),
                          End_Date = c("10/01/2020", "20/01/2020", "30/01/2020", "25/01/2020")
)
 
start_data
  Person Event Var1 Var2 Var3 Start_Date   End_Date
1      1     1    1    7   13  1/01/2020 10/01/2020
2      1     2    2    8   14  5/01/2020 20/01/2020
3      1     3    3    9   15 21/01/2020 30/01/2020
4      1     4    5    6    7 23/01/2020 25/01/2020

And I want to convert it into this:

end_data <- data.frame(stringsAsFactors=FALSE,
                        Person = c(1, 1, 1, 1, 1, 1),
                        Event = c("1", "1 AND 2", "2", "3", "3 AND 4", "3"),
                        Var1 = c("1", "1 AND 2", "2", "3", "3 AND 5", "3"),
                        Var2 = c("7", "7 AND 8", "8", "9", "9 AND 6", "9"),
                        Var3 = c(13, 14, 14, 15, 15, 15),
                        Start_Date = c("1/01/2020", "5/01/2020", "11/01/2020", "21/01/2020",
                                       "23/01/2020", "26/01/2020"),
                        End_Date = c("4/01/2020", "10/01/2020", "20/01/2020", "22/01/2020",
                                     "25/01/2020", "30/01/2020")
)
 
end_data
  Person   Event    Var1    Var2 Var3 Start_Date   End_Date
1      1       1       1       7   13  1/01/2020  4/01/2020
2      1 1 AND 2 1 AND 2 7 AND 8   14  5/01/2020 10/01/2020
3      1       2       2       8   14 11/01/2020 20/01/2020
4      1       3       3       9   15 21/01/2020 22/01/2020
5      1 3 AND 4 3 AND 5 9 AND 6   15 23/01/2020 25/01/2020
6      1       3       3       9   15 26/01/2020 30/01/2020

The code should expand the data frame to isolate overlapping start_date and end_date ranges. When there is an overlapping range it should create a new row with the overlapping data. Therefore, when looking at the final table, there should be no Start_Date and End_Date ranges that overlap with one another. Further, the result of the table should be summarised for Event, Var1 and Var3 by concatenating the results. Var 3 should also be aggregated by taking the max value in the overlapping ranges.

Ideally, I want to apply this code to a number of "Persons" so playing nice with group_by() or nesting with dplyr would be preferential.

Edit:

In the case of 3 overlapping periods as per a question below. It would look like this; enter image description here

Edit2:

The solution by @ekoam is very close. However, it does not deal with the below example. Event 5 encapsulates the whole period. Hence, there should be no missing ranges. However, a 'start' and 'end' for "2017-05-17" and "2017-06-11" respectively is missing.

> trial_start_data <- data.frame(stringsAsFactors=FALSE,
+                          Person = c(1, 1, 1, 1),
+                          Event = c(5,6,7,8),
+                          Start_Date = as.Date(c("24/04/2017","09/05/2017","12/06/2017","21/06/2017"), "%d/%m/%Y"),
+                          End_Date = as.Date(c("28/09/2017","16/05/2017","21/06/2017","25/06/2017"), "%d/%m/%Y")
+ )
> 
> trial_start_data
  Person Event Start_Date   End_Date
1      1     5 2017-04-24 2017-09-28
2      1     6 2017-05-09 2017-05-16
3      1     7 2017-06-12 2017-06-21
4      1     8 2017-06-21 2017-06-25

> disjoint_subsets(trial_start_data$Start_Date, trial_start_data$End_Date)
       start        end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-06-12 2017-06-20
4 2017-06-21 2017-06-21
5 2017-06-22 2017-06-25
6 2017-06-26 2017-09-28

Solution

  • The main problem here is to find all disjoint subsets of a set of ranges in an efficient way. Consider this function

    disjoint_subsets <- function(starts, ends) {
      t1 <- min(starts)
      starts <- as.integer(starts - t1)
      ends <- as.integer(ends - t1) + 2L
      nvec <- ends - starts + 1L
      x <- sequence(nvec, starts) * 10L
      ends <- cumsum(nvec); starts <- ends - nvec + 1L
      x[ends] <- x[ends] - 9L; x[starts] <- x[starts] + 9L
      x <- sort(unique(x))
      b <- which(x %% 10L > 0L)
      lb <- x[head(b[!(b + 1L) %in% b], -1L)]
      ub <- x[tail(b[!(b - 1L) %in% b], -1L)]
      lb <- (lb + 9L * (lb %% 10L < 2L) + 1L) %/% 10L
      ub <- (ub - 9L * (ub %% 10L > 8L) - 1L) %/% 10L
      data.frame(start = lb + t1 - 1L, end = ub + t1 - 1L)
    }
    
    Usage
    > with(trial_start_data, disjoint_subsets(Start_Date, End_Date))
           start        end
    1 2017-04-24 2017-05-08
    2 2017-05-09 2017-05-16
    3 2017-05-17 2017-06-11
    4 2017-06-12 2017-06-20
    5 2017-06-21 2017-06-21
    6 2017-06-22 2017-06-25
    7 2017-06-26 2017-09-28
    

    However, the function is a bit slow. There is still room for improvement. It finds all disjoint subsets in about a second for a dataframe with 100k rows. For a dataframe with 1 million rows, it takes 10-15 seconds to run. See the benchmark

    starts_e6 <- sample(Sys.Date() + -1000:1000, size = 1e6, T)
    ends_e6 <- starts_e6 + sample.int(1000, 1e6, T)
    starts_e5 <- sample(Sys.Date() + -1000:1000, size = 1e5, T)
    ends_e5 <- starts_e5 + sample.int(1000, 1e5, T)
    microbenchmark::microbenchmark(
      disjoint_subsets(starts_e6, ends_e6),
      disjoint_subsets(starts_e5, ends_e5), 
      times = 2L
    )
    
    Unit: milliseconds
                                     expr      min       lq      mean    median        uq       max neval cld
     disjoint_subsets(starts_e6, ends_e6) 11299.59 11299.59 11366.623 11366.623 11433.652 11433.652     2   b
     disjoint_subsets(starts_e5, ends_e5)   873.66   873.66  1028.057  1028.057  1182.455  1182.455     2  a 
    

    base::unique is the bottleneck here. However, if we can somehow use fewer elements to represent an interval, then we can save a lot of time.

    The rest is just a piece of cake. You can use data.table::foverlap to perform a non-equal join of start_data and all the disjoint subsets. Then, summarise the joined data.table to get the end_data you want. For example,

    library(data.table)
    
    setDT(trial_start_data)[, c("Start_Date", "End_Date") := lapply(.SD, as.Date, "%d/%m/%Y"), .SDcols = c("Start_Date", "End_Date")]
    dsubs = trial_start_data[, disjoint_subsets(Start_Date, End_Date)]; setDT(dsubs)
    
    setkey(dsubs, start, end)
    setkey(trial_start_data, Start_Date, End_Date)
    
    foverlaps(dsubs, trial_start_data, type = "within")
    

    Output

        Person Event Start_Date   End_Date      start        end
     1:      1     5 2017-04-24 2017-09-28 2017-04-24 2017-05-08
     2:      1     5 2017-04-24 2017-09-28 2017-05-09 2017-05-16
     3:      1     6 2017-05-09 2017-05-16 2017-05-09 2017-05-16
     4:      1     5 2017-04-24 2017-09-28 2017-05-17 2017-06-11
     5:      1     5 2017-04-24 2017-09-28 2017-06-12 2017-06-20
     6:      1     7 2017-06-12 2017-06-21 2017-06-12 2017-06-20
     7:      1     5 2017-04-24 2017-09-28 2017-06-21 2017-06-21
     8:      1     7 2017-06-12 2017-06-21 2017-06-21 2017-06-21
     9:      1     8 2017-06-21 2017-06-25 2017-06-21 2017-06-21
    10:      1     5 2017-04-24 2017-09-28 2017-06-22 2017-06-25
    11:      1     8 2017-06-21 2017-06-25 2017-06-22 2017-06-25
    12:      1     5 2017-04-24 2017-09-28 2017-06-26 2017-09-28
    

    A dataframe with just 100k rows should not be a problem for any data.table function. This is also by far the most efficient way I can think of. I will leave out the remaining steps as the answer is pretty long now. Also, I think they were covered by my answer to one of your previous posts.