Search code examples
rdplyrdata.tableiranges

Splitting overlapping rows, within groups, based on dates


I'm trying to create new rows based on overlapping time periods of existing rows. For example, I'd like to turn this:

Customer_Product <- data.table(Customer=c("A01","A01","A01", "A02", "A02", "A02", "A03", "A03", "A03"), 
                Product=c("Prod1","Prod2","Prod3","Prod1","Prod2","Prod3","Prod1","Prod2","Prod3"), 
                Start_Date=c("1/1/2015", "3/1/2015", "4/1/2015", "1/1/2015", "3/1/2015", "4/1/2015", "1/1/2015", "3/1/2015", "4/1/2015"),
                End_Date=c("2/1/2015","5/1/2015","5/1/2015","2/1/2015","5/1/2015","6/1/2015","2/1/2015","6/1/2015","5/1/2015"))
   Customer Product Start_Date End_Date
1:      A01   Prod1   1/1/2015 2/1/2015
2:      A01   Prod2   3/1/2015 5/1/2015
3:      A01   Prod3   4/1/2015 5/1/2015
4:      A02   Prod1   1/1/2015 2/1/2015
5:      A02   Prod2   3/1/2015 5/1/2015
6:      A02   Prod3   4/1/2015 6/1/2015
7:      A03   Prod1   1/1/2015 2/1/2015
8:      A03   Prod2   3/1/2015 6/1/2015
9:      A03   Prod3   4/1/2015 5/1/2015

Into something like this:

Customer_Product_Combo <- data.table(Customer=c("A01","A01","A01", "A02", "A02", "A02", "A02","A03", "A03","A03","A03"),
                Product_or_Combination=c("Prod1","Prod2","Prod2/Prod3","Prod1","Prod2","Prod2/Prod3","Prod3","Prod1","Prod2","Prod2/Prod3","Prod2"),
                Start_Date=c("1/1/2015","3/1/2015","4/1/2015","1/1/2015","3/1/2015","4/1/2015","5/1/2015","1/1/2015","3/1/2015","4/1/2015","5/1/2015"),
                End_Date=c("2/1/2015","4/1/2015","5/1/2015","2/1/2015","4/1/2015","5/1/2015","6/1/2015","2/1/2015","4/1/2015","5/1/2015","6/1/2015"))
    Customer Product_or_Combination Start_Date End_Date
 1:      A01                  Prod1   1/1/2015 2/1/2015
 2:      A01                  Prod2   3/1/2015 4/1/2015
 3:      A01            Prod2/Prod3   4/1/2015 5/1/2015
 4:      A02                  Prod1   1/1/2015 2/1/2015
 5:      A02                  Prod2   3/1/2015 4/1/2015
 6:      A02            Prod2/Prod3   4/1/2015 5/1/2015
 7:      A02                  Prod3   5/1/2015 6/1/2015
 8:      A03                  Prod1   1/1/2015 2/1/2015
 9:      A03                  Prod2   3/1/2015 4/1/2015
10:      A03            Prod2/Prod3   4/1/2015 5/1/2015
11:      A03                  Prod2   5/1/2015 6/1/2015

I've been looking into IRanges, because it seems like disjoin() may be a possible solution, but I can't see any way to inherit/merge the "Prod" data.

I've also been trying to sketch out something using lead/lag in dplyr followed by a gather/merge cycle, but it's also worth noting that I could have instances where more than 2 "Prod"s overlap, and then the logic just gets messy.

Is there a reasonable way to do this? Any help is greatly appreciated!


Solution

  • I'm using the data you posted (as a data.frame)

    Customer_Product <- data.frame(Customer=c("A01","A01","A01", "A02", "A02", "A02", "A03", "A03", "A03"), 
                                   Product=c("Prod1","Prod2","Prod3","Prod1","Prod2","Prod3","Prod1","Prod2","Prod3"), 
                                   Start_Date=c("1/1/2015", "3/1/2015", "4/1/2015", "1/1/2015", "3/1/2015", "4/1/2015", "1/1/2015", "3/1/2015", "4/1/2015"),
                                   End_Date=c("2/1/2015","5/1/2015","5/1/2015","2/1/2015","5/1/2015","6/1/2015","2/1/2015","6/1/2015","5/1/2015"))
    

    Here's a possible solution:

    library(tidyverse)
    library(data.table)
    library(lubridate)
    
    Customer_Product %>%
      mutate_at(vars(matches("Date")), dmy) %>%                          # update to date columns (if needed)
      mutate(day = map2(Start_Date, End_Date, ~seq(.x, .y, "day"))) %>%  # create sequence of days between start and end
      unnest() %>%                                                       # unnest data
      group_by(Customer, day) %>%                                        # for each customer and day
      summarise(Product = paste0(Product, collapse = "/")) %>%           # find corresponding products
      group_by(Customer, Product, id = rleid(Product)) %>%               # for each customer, product combination and position of product combination
      summarise(Start_Date = min(day),                                   # get start date
                End_Date = max(day)) %>%                                 # get end date
      ungroup() %>%                                                      # ungroup
      select(-id) %>%                                                    # remove id column
      arrange(Customer, Start_Date)                                      # order rows (if needed)
    
    
    # # A tibble: 11 x 4
    #   Customer Product     Start_Date End_Date  
    #   <fct>    <chr>       <date>     <date>    
    # 1 A01      Prod1       2015-01-01 2015-01-02
    # 2 A01      Prod2       2015-01-03 2015-01-03
    # 3 A01      Prod2/Prod3 2015-01-04 2015-01-05
    # 4 A02      Prod1       2015-01-01 2015-01-02
    # 5 A02      Prod2       2015-01-03 2015-01-03
    # 6 A02      Prod2/Prod3 2015-01-04 2015-01-05
    # 7 A02      Prod3       2015-01-06 2015-01-06
    # 8 A03      Prod1       2015-01-01 2015-01-02
    # 9 A03      Prod2       2015-01-03 2015-01-03
    #10 A03      Prod2/Prod3 2015-01-04 2015-01-05
    #11 A03      Prod2       2015-01-06 2015-01-06
    

    Note that this solution doesn't allow for date range overlap in your output table.

    For example, if you have Prod2/Prod3 during 4/1/2015 - 5/1/2015 you won't get Prod2 during 5/1/2015 - 6/1/2015, but 6/1/2015 - 6/1/2015, as 5/1/2015 is covered in Prod2/Prod3.