I would like to be able to take two variables representing a starting date and an ending date and create variables indicating which years are covered over the range of those two dates.
What I have:
df1 <- data.frame(ID = c("A", "B", "C"),
Start_Date = c("3/5/2004", "8/22/2005", "4/8/2008"),
End_Date = c("6/25/2009","11/2/2006", "6/9/2011"))
What I want:
df2 <- data.frame(ID = c("A", "B", "C"),
Start_Date = c("3/5/2004", "8/22/2005", "4/8/2008"),
End_Date = c("6/25/2009","11/2/2006", "6/9/2011"),
y2004 = c(1, 0, 0),
y2005 = c(1, 1, 0),
y2006 = c(1, 1, 0),
y2007 = c(1, 0, 0),
y2008 = c(1, 0, 1),
y2009 = c(0, 0, 1),
y2010 = c(0, 0, 1),
y2011 = c(0, 0, 1))
As above, each new year variable indicates whether or not the year is captured in the range of the two date variables "Start_Date" and "End_Date".
Any ideas would be greatly appreciated. Thanks in advance.
One method is to pivot to 'long' format, extract the year
part after converting to Date
class, then get the seq
(:
) from the first to last grouped by 'ID' and reshape back to 'wide', then join with the original data by 'ID'
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
df1 %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(year = str_c('y', year(mdy(value)[1]):year(mdy(value)[2])),
n = 1, .groups = 'drop') %>%
pivot_wider(names_from = year, values_from = n, values_fill = 0) %>%
left_join(df1, .)
-output
# ID Start_Date End_Date y2004 y2005 y2006 y2007 y2008 y2009 y2010 y2011
#1 A 3/5/2004 6/25/2009 1 1 1 1 1 1 0 0
#2 B 8/22/2005 11/2/2006 0 1 1 0 0 0 0 0
#3 C 4/8/2008 6/9/2011 0 0 0 0 1 1 1 1