Search code examples
rtidyverselubridate

Divide groups in other groups by date intervals


I'm dealing with dates and I wanted to group some rows together but I can't find how. In my data, one row is an individual in a time interval and in a place. Something like that :

     ind place start_date end_date  
   <int> <int> <date>     <date>    
 1     1     1 2011-01-04 2011-01-05
 2     1     1 2011-01-05 2011-01-06
 3     1     1 2011-01-10 2011-01-11
 4     1     4 2010-12-30 2010-12-31
 5     1     4 2010-12-31 2011-01-01
 6     1     4 2011-01-01 2011-01-03
 7     2     2 2018-02-17 2018-02-19
 8     2     2 2018-02-19 2018-02-23
 9     3     3 2018-02-08 2018-02-13
10     3     3 2018-02-13 2018-02-16
11     3     3 2018-02-16 2018-02-20
12     3     3 2018-03-27 2018-03-29

As you can see, sometimes, start_date is the same as in previous end_date. I can group_by ind and place in order to merge time intervals, but the problem is that there are some rows where there is a gap between start_date and previous end_date. So I need to divide these groups in multiple other groups before merging.

What I want to obtain first is :

     ind place start_date end_date   group
   <int> <int> <date>     <date>     <int>
 1     1     1 2011-01-04 2011-01-05   1
 2     1     1 2011-01-05 2011-01-06   1
 3     1     1 2011-01-10 2011-01-11   2
 4     1     4 2010-12-30 2010-12-31   3
 5     1     4 2010-12-31 2011-01-01   3
 6     1     4 2011-01-01 2011-01-03   3
 7     2     2 2018-02-17 2018-02-19   4
 8     2     2 2018-02-19 2018-02-23   4
 9     3     3 2018-02-08 2018-02-13   5
10     3     3 2018-02-13 2018-02-16   5
11     3     3 2018-02-16 2018-02-20   5
12     3     3 2018-03-27 2018-03-29   6

And at the end :

     ind place start_date end_date  
    <chr> <chr> <date>     <date>    
 1     1     1 2011-01-04 2011-01-06
 2     1     1 2011-01-10 2011-01-11
 3     1     4 2010-12-30 2011-01-03
 4     2     2 2018-02-17 2018-02-23
 5     3     3 2018-02-08 2018-02-20
 6     3     3 2018-03-27 2018-03-29

Do you have an idea of how can I do it ?

Thanks !


Solution

  • One way is to add an additional grouping variable that checks if end_date is less then start_date

    library(dplyr)
    
    df %>% 
      group_by(grp = cumsum(lag(end_date, default=end_date[1]) < start_date),
               ind, place) %>% 
      summarize(start_date = first(start_date), 
                end_date = last(end_date), .groups="drop") %>% 
      select(-grp)
    # A tibble: 6 × 4
        ind place start_date end_date  
      <int> <int> <date>     <date>    
    1     1     1 2011-01-04 2011-01-06
    2     1     1 2011-01-10 2011-01-11
    3     1     4 2010-12-30 2011-01-03
    4     2     2 2018-02-17 2018-02-23
    5     3     3 2018-02-08 2018-02-20
    6     3     3 2018-03-27 2018-03-29