Search code examples
rlubridatewide-format-data

Make dataset wide format for entries falling within a certain date interval


I am trying to figure out how to make multiple customer orders placed within 120 days wide format for each customer. If a customer has orders placed over the course of several year, then there are multiple intervals of 120 days, so each customer may have multiple rows, with each row corresponding to an interval of 120 days from an index date. The index date(s) would be the first order date that falls outside the 120-day interval from the prior index order date. Thank you!

I have a dataframe:

customerid <- c("A1", "A1", "A1", "A1", "A1", "A2", "A2", "A2")
orderid <- c("1", "2", "3", "4", "5", "6", "7", "8")
orderdate <- c("2020-05-19", "2020-09-08", "2020-09-16", "2020-12-21", "2021-01-03", "2020-08-21","2020-11-22","2021-02-01")
df <- data.frame(customerid, orderid, orderdate)

The result should be:

enter image description here

Thank you!


Solution

  • I think easiest approach is:

    1. Build an index data frame which shows the first order for each customer

    2. For every order work out what '120 day interval' the order is located in

    3. Get a distinct order number within the 120 day interval

    4. Pivot that data wider with customerid and int120 as ID columns

      customerid <- c("A1", "A1", "A1", "A1", "A1", "A2", "A2", "A2")
      orderid <- c("1", "2", "3", "4", "5", "6", "7", "8")
      orderdate <- as.Date(c("2020-05-19", "2020-09-08", "2020-09-16", "2020-12-21", "2021-01-03", "2020-08-21","2020-11-22","2021-02-01"))
      df <- data.frame(customerid, orderid, orderdate)
      
      indexes <- df %>% 
        slice(which.min(orderdate),.by = customerid) %>% 
        select(customerid,indexdate = orderdate) # create index df
      
      df %>% 
        left_join(indexes, by = "customerid") %>% 
        mutate(days_diff = orderdate - indexdate,
               int120 = floor(days_diff/120)) %>% 
        group_by(customerid,int120) %>% 
        mutate(order_in_int120 = row_number()) %>% 
        ungroup() %>% 
        pivot_wider(id_cols = c("customerid","int120"),
                    names_from = order_in_int120,
                    values_from = c("orderid","orderdate"))