I have a dataset that looks like this:
df=data.frame(c(1,2,2,2,3,4,4),
as.Date(c("2015-01-29","2015-02-02","2015-02-02","2015-02-02","2014-05-04","2014-05-04","2014-05-04")),
as.Date(c( "2010-10-01","2009-09-01","2014-01-01","2014-02-01","2009-01-01","2014-03-01","2013-03-01")),
as.Date(c("2016-04-30","2013-12-31","2014-01-31","2016-04-30","2014-02-28","2014-08-31","2013-05-01")));
names(df)=c('id','poi','start','end')
> df
id poi start end
1 1 2015-01-29 2010-10-01 2016-04-30
2 2 2015-02-02 2009-09-01 2013-12-31
3 2 2015-02-02 2014-01-01 2014-01-31
4 2 2015-02-02 2014-02-01 2016-04-30
5 3 2014-05-04 2009-01-01 2014-02-28
6 4 2014-05-04 2014-03-01 2014-08-31
7 4 2014-05-04 2013-03-01 2013-05-01
The start and end dates are insurance begin and end dates, sometimes there are same start dates for multiple rows because they pertain to different insurance types. I am interested in retaining those IDs that has consistent insurance coverage one year before and after the poi. Each ID can only have 1 poi.
My output would be a list of IDs that have insurance coverage 1 year pre and 1 year post the poi. In this case it would exlude ids 3 and 4 becuase they dont have coverage 1 year after poi.
ids=c(1,2)
I have tried the following, but honestly have no idea about how I can achieve what I want.
Any help will be really appreciated.
library(rehape2)
df.melt=melt(df,
id=c("id","poi"))
df.melt=mutate(df.melt, flag=ave(id,id,variable,FUN=seq_along))
df.melt=mutate(df.melt, variable=paste(variable,flag,sep ="_"))
df.cast=dcast(df.melt, id+poi~variable)
If you want to evaluate the rows individually, with dplyr
and lubridate
:
library(dplyr)
library(lubridate)
# filter to only rows with a POI within the desired range
df %>% filter(poi - years(1) >= start,
poi + years(1) <= end)
# id poi start end
# 1 1 2015-01-29 2010-10-01 2016-04-30
# 2 2 2015-02-02 2014-02-01 2016-04-30
If you'd rather evaluate all the rows for one ID, maybe something like
# group to summarize IDs separately
df %>% group_by(id, poi) %>%
# collapse rows to min start and max end for each ID
summarise(start = min(start),
end = max(end)) %>%
# filter to only rows with a POI within the desired range
filter(poi - years(1) >= start,
poi + years(1) <= end)
# Source: local data frame [2 x 4]
# Groups: id [2]
#
# id poi start end
# (dbl) (date) (date) (date)
# 1 1 2015-01-29 2010-10-01 2016-04-30
# 2 2 2015-02-02 2009-09-01 2016-04-30
Such an approach would overlook gaps in coverage, though, if that's a possibility. If so, lubridate::interval
and int_overlaps
may be useful in condensing rows carefully.