Search code examples
rdatedate-range

Split date range into several chunks ending by YYYY-12-31


df <- data.frame(group = c("a", "a", "b", "b"),
                 start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07"),
                 end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02"))  

Let's say I have the following df:

  group      start        end
1     a 2017-05-01 2018-09-01
2     a 2019-04-03 2020-04-03
3     b 2011-03-03 2012-05-03
4     b 2014-05-07 2016-04-02

I want to get it into this format, with each record split into starting date and 31/12 of that and subsequent years:

  group      start        end
1     a 2017-05-01 2017-12-31
2     a 2018-01-01 2018-09-01
3     a 2019-04-03 2019-12-31
4     a 2020-01-01 2020-04-03
5     b 2011-03-03 2011-12-31
6     b 2012-01-01 2012-05-03
7     b 2014-05-07 2014-12-31
8     b 2015-01-01 2015-12-31
9     b 2016-01-01 2016-04-02

Any ideas on how to tackle this?

Edit:

My primary concerns are not the date ranges that are within the same year. However, as chinsoon12 pointed out, it would indeed be helpful if the approach could handle them as well, as for instance in this dataset:

df <- data.frame(group = c("a", "a", "b", "b", "c"),
                 start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
                 end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05")) 

The end result would leave the last line as it was:

   group      start        end
1      a 2017-05-01 2017-12-31
2      a 2018-01-01 2018-09-01
3      a 2019-04-03 2019-12-31
4      a 2020-01-01 2020-04-03
5      b 2011-03-03 2011-12-31
6      b 2012-01-01 2012-05-03
7      b 2014-05-07 2014-12-31
8      b 2015-01-01 2015-12-31
9      b 2016-01-01 2016-04-02
10     c 2017-02-01 2017-04-05  

Solution

  • A possible solution with :

    library(data.table)
    setDT(df)
    
    df[df[, rep(.I, 1 + year(end) - year(start))]
       ][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
                 end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
         , by = .(group, rleid(start))][]
    

    which gives:

        group      start        end
     1:     a 2017-05-01 2017-12-31
     2:     a 2018-01-01 2018-09-01
     3:     a 2019-04-03 2019-12-31
     4:     a 2020-01-01 2020-04-03
     5:     b 2011-03-03 2011-12-31
     6:     b 2012-01-01 2012-05-03
     7:     b 2014-05-07 2014-12-31
     8:     b 2015-01-01 2015-12-31
     9:     b 2016-01-01 2016-04-02
    10:     c 2017-02-01 2017-04-05
    

    Two alternative solutions with :

    # alternative 1:
    df[, ri := rowid(group)
       ][df[, rep(.I, 1 + year(end) - year(start))]
         ][, `:=` (start = if (.N == 1) start else c(start[1], as.Date(paste0(year(start[1]) + 1:(.N-1), '-01-01') )),
                   end = if (.N == 1) end else c(as.Date(paste0(year(end[.N]) - (.N-1):1, '-12-31') ), end[.N]))
           , by = .(group, ri)][, ri := NULL][]
    
    # alternative 2:
    df[, ri := rowid(group)
       ][df[, rep(.I, 1 + year(end) - year(start))]
         ][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
                   end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
           , by = .(group, ri)][, ri := NULL][]
    

    Used data:

    df <- data.frame(group = c("a", "a", "b", "b", "c"),
                     start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
                     end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05")) 
    df[2:3] <- lapply(df[2:3], as.Date)