Search code examples
rdataframedatetimeaggregateperiod

collapse consecutive periods if the same


I tried to express my use case in R code as follows:

haves <- data.frame(
          id = c(1,1,1,1)
        , start = c(as.Date('2022-01-01'), as.Date('2022-02-01'), as.Date('2022-02-01'), as.Date('2022-02-16'))
        , end = c(as.Date('2022-01-31'), as.Date('2022-02-28'), as.Date('2022-02-15'), as.Date('2022-02-28'))
        , category = c("A", "B", "A", "A")
    )

haves

enter image description here

wants <- data.frame(
          id = c(1,1,1)
        , start = c(as.Date('2022-01-01'), as.Date('2022-02-01'), as.Date('2022-02-01'))
        , end = c(as.Date('2022-01-31'), as.Date('2022-02-28'), as.Date('2022-02-28'))
        , category = c("A", "B", "A")
    )

wants

enter image description here

Basically I want to collapse time periods (see last 2 columns in wants) if they are consecutive and contain the same category but also depended on id. Do you think this is possible? A group by id and category and then using min and max on start and end would not work.


Solution

  • Using rle, this should work:

    r <- rle(haves$category)$l
    haves %>% 
      mutate(cons = rep(seq(r), r)) %>% 
      group_by(id, category, cons) %>% 
      summarise(start = min(start),
                end = max(end))
    
    # A tibble: 3 x 5
    # Groups:   id, category [2]
         id category  cons start      end       
      <dbl> <chr>    <int> <date>     <date>    
    1     1 A            1 2022-01-01 2022-01-31
    2     1 A            3 2022-02-01 2022-02-28
    3     1 B            2 2022-02-01 2022-02-28
    

    Or with cons = data.table::rleid(category).