Search code examples
rdplyrlubridatedata-cleaningdata-munging

dplyr Collapsing time periods


I have a data.frame below and I want to "chunk" the time periods such that for each company_id it "collapses" the time periods to those that are separated by thirty days.

   company_id  startDate    endDate
1      209952 2012-09-17 2012-10-17
2      209952 2012-10-17 2012-11-17
3      209952 2012-11-17 2012-12-17
4      209952 2012-12-17 2013-01-17
5      209952 2013-01-17 2013-02-17
6      209952 2013-02-17 2013-03-17
7      209952 2013-03-17 2013-04-17
8      209952 2013-04-17 2013-05-17
9      209952 2013-05-17 2013-06-17
10     209952 2013-06-17 2013-07-17
11     209952 2013-07-17 2013-08-17
12     209952 2013-08-17 2013-09-17
13     209952 2013-09-17 2013-10-17
14     209952 2013-10-17 2013-11-17
15     209952 2013-11-17 2013-12-17
16     209952 2013-12-17 2014-01-17
17     209952 2014-01-17 2014-02-17
18     209952 2014-02-12 2014-03-12
19     209952 2014-03-12 2014-04-12
20     209952 2014-04-12 2014-05-12
21     209952 2014-05-12 2014-06-12
22     209952 2014-06-12 2014-07-12
23     209952 2014-07-12 2014-08-12
24     209952 2014-08-12 2014-09-12
25     209952 2014-09-12 2014-10-12
26     209952 2014-10-12 2014-11-12
27     209952 2014-11-12 2014-12-12
28     209952 2014-12-12 2015-01-12
29     209952 2015-01-12 2015-02-12
30     209952 2015-02-12 2015-03-12
31     209952 2015-03-12 2015-04-12
32     209952 2015-04-13 2015-05-13
33     209952 2015-05-07 2016-05-07
34     209952 2015-05-07 2015-06-07
35     209952 2015-06-07 2015-07-07
36     209952 2015-07-07 2015-08-07
37     209952 2015-08-07 2015-09-07
38     209952 2016-05-07 2017-10-23
39    2802315 2012-10-19 2012-11-19
40    2802315 2012-11-19 2012-12-19
41    2802315 2012-12-19 2013-01-19
42    2802315 2013-01-19 2013-02-19
43    2802315 2013-02-19 2013-03-19
44    2802315 2013-03-19 2013-04-19
45    2802315 2013-04-19 2013-05-19
46    2802315 2013-05-19 2013-06-19
47    2802315 2013-06-19 2013-07-19
48    2802315 2013-07-19 2013-08-19
49    2802315 2013-08-19 2013-09-19
50    2802315 2013-09-19 2013-10-19
51    2802315 2013-10-19 2013-11-19
52    2802315 2013-11-18 2013-12-18
53    2802315 2013-12-18 2014-01-18
54    2802315 2014-01-18 2014-02-18
55    2802315 2014-02-18 2014-03-18
56    2802315 2014-03-18 2014-04-18
57    2802315 2014-04-18 2014-05-18
58    2802315 2014-09-29 2014-10-29
59    2802315 2014-10-29 2014-11-29
60    2802315 2015-04-22 2015-05-22
61    2802315 2015-05-21 2015-06-21
62    2802315 2015-06-23 2015-09-23
63    2802315 2015-07-23 2015-08-23
64    2802315 2015-11-23 2016-05-23

I have tried the below:

test <- blocks %>%
  filter(company_id %in% c(209952, 2802315)) %>%
  arrange(company_id, startDate) %>%
  group_by(company_id) %>%
  mutate(
    week = cumsum(startDate - lag(endDate, default = 0) > 30)
  ) %>%
  group_by(company_id, week) %>%
  summarize(
    startDate = min(startDate),
    endDate = max(endDate)
  )

The problem is that the intervals on line (1) and (2) should be combined into one so startDate = 2012-09-17 and endDate = 2017-10-23 because there's less than thirty days in between.

  company_id  week  startDate    endDate
       <dbl> <int>     <date>     <date>
1     209952     1 2012-09-17 2016-05-07
2     209952     2 2016-05-07 2017-10-23
3    2802315     1 2012-10-19 2014-05-18
4    2802315     2 2014-09-29 2014-11-29
5    2802315     3 2015-04-22 2015-09-23
6    2802315     4 2015-11-23 2016-05-23

The output I'm looking for is

1     209952     1 2012-09-17 2016-10-23
2    2802315     1 2012-10-19 2014-05-18
3    2802315     2 2014-09-29 2014-11-29
4    2802315     3 2015-04-22 2015-09-23
5    2802315     4 2015-11-23 2016-05-23

Solution

  • How about calling your mutate + summarize twice:

    chunk = function(DF){
      DF %>%
        mutate(
          week = cumsum(startDate - lag(endDate, default = 0) > 30)
        ) %>%
        group_by(company_id, week) %>%
        summarize(
          startDate = min(startDate),
          endDate = max(endDate)
        )
    }
    
    blocks %>%
      arrange(company_id, startDate) %>%
      group_by(company_id) %>%
      chunk() %>%
      chunk()
    

    Result:

    # A tibble: 5 x 4
    # Groups:   company_id [?]
      company_id  week  startDate    endDate
           <int> <int>     <date>     <date>
    1     209952     1 2012-09-17 2017-10-23
    2    2802315     1 2012-10-19 2014-05-18
    3    2802315     2 2014-09-29 2014-11-29
    4    2802315     3 2015-04-22 2015-09-23
    5    2802315     4 2015-11-23 2016-05-23
    

    Data:

    blocks = structure(list(company_id = c(209952L, 209952L, 209952L, 209952L, 
    209952L, 209952L, 209952L, 209952L, 209952L, 209952L, 209952L, 
    209952L, 209952L, 209952L, 209952L, 209952L, 209952L, 209952L, 
    209952L, 209952L, 209952L, 209952L, 209952L, 209952L, 209952L, 
    209952L, 209952L, 209952L, 209952L, 209952L, 209952L, 209952L, 
    209952L, 209952L, 209952L, 209952L, 209952L, 209952L, 2802315L, 
    2802315L, 2802315L, 2802315L, 2802315L, 2802315L, 2802315L, 2802315L, 
    2802315L, 2802315L, 2802315L, 2802315L, 2802315L, 2802315L, 2802315L, 
    2802315L, 2802315L, 2802315L, 2802315L, 2802315L, 2802315L, 2802315L, 
    2802315L, 2802315L, 2802315L, 2802315L), startDate = structure(c(15600, 
    15630, 15661, 15691, 15722, 15753, 15781, 15812, 15842, 15873, 
    15903, 15934, 15965, 15995, 16026, 16056, 16087, 16113, 16141, 
    16172, 16202, 16233, 16263, 16294, 16325, 16355, 16386, 16416, 
    16447, 16478, 16506, 16538, 16562, 16562, 16593, 16623, 16654, 
    16928, 15632, 15663, 15693, 15724, 15755, 15783, 15814, 15844, 
    15875, 15905, 15936, 15967, 15997, 16027, 16057, 16088, 16119, 
    16147, 16178, 16342, 16372, 16547, 16576, 16609, 16639, 16762
    ), class = "Date"), endDate = structure(c(15630, 15661, 15691, 
    15722, 15753, 15781, 15812, 15842, 15873, 15903, 15934, 15965, 
    15995, 16026, 16056, 16087, 16118, 16141, 16172, 16202, 16233, 
    16263, 16294, 16325, 16355, 16386, 16416, 16447, 16478, 16506, 
    16537, 16568, 16928, 16593, 16623, 16654, 16685, 17462, 15663, 
    15693, 15724, 15755, 15783, 15814, 15844, 15875, 15905, 15936, 
    15967, 15997, 16028, 16057, 16088, 16119, 16147, 16178, 16208, 
    16372, 16403, 16577, 16607, 16701, 16670, 16944), class = "Date")), class = "data.frame", .Names = c("company_id", 
    "startDate", "endDate"), row.names = c(NA, -64L))
    
    library(lubridate)
    blocks = blocks %>%
      mutate_if(is.character, ymd)