df <- data.frame(group = c("a", "a", "b", "b"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02"))
Let's say I have the following df:
group start end
1 a 2017-05-01 2018-09-01
2 a 2019-04-03 2020-04-03
3 b 2011-03-03 2012-05-03
4 b 2014-05-07 2016-04-02
I want to get it into this format, with each record split into starting date and 31/12 of that and subsequent years:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
Any ideas on how to tackle this?
Edit:
My primary concerns are not the date ranges that are within the same year. However, as chinsoon12 pointed out, it would indeed be helpful if the approach could handle them as well, as for instance in this dataset:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
The end result would leave the last line as it was:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
10 c 2017-02-01 2017-04-05
A possible solution with data.table:
library(data.table)
setDT(df)
df[df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, rleid(start))][]
which gives:
group start end 1: a 2017-05-01 2017-12-31 2: a 2018-01-01 2018-09-01 3: a 2019-04-03 2019-12-31 4: a 2020-01-01 2020-04-03 5: b 2011-03-03 2011-12-31 6: b 2012-01-01 2012-05-03 7: b 2014-05-07 2014-12-31 8: b 2015-01-01 2015-12-31 9: b 2016-01-01 2016-04-02 10: c 2017-02-01 2017-04-05
Two alternative solutions with data.table:
# alternative 1:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = if (.N == 1) start else c(start[1], as.Date(paste0(year(start[1]) + 1:(.N-1), '-01-01') )),
end = if (.N == 1) end else c(as.Date(paste0(year(end[.N]) - (.N-1):1, '-12-31') ), end[.N]))
, by = .(group, ri)][, ri := NULL][]
# alternative 2:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, ri)][, ri := NULL][]
Used data:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
df[2:3] <- lapply(df[2:3], as.Date)