Search code examples
rdateconditional-statementsaggregate

R collapse date fields with conditions


I have a dataset that looks like this:

df=data.frame(c(1,2,2,2,3,4,4),
as.Date(c("2015-01-29","2015-02-02","2015-02-02","2015-02-02","2014-05-04","2014-05-04","2014-05-04")),
as.Date(c( "2010-10-01","2009-09-01","2014-01-01","2014-02-01","2009-01-01","2014-03-01","2013-03-01")),
as.Date(c("2016-04-30","2013-12-31","2014-01-31","2016-04-30","2014-02-28","2014-08-31","2013-05-01"))); 
names(df)=c('id','poi','start','end')

> df
  id        poi      start        end
1  1 2015-01-29 2010-10-01 2016-04-30
2  2 2015-02-02 2009-09-01 2013-12-31
3  2 2015-02-02 2014-01-01 2014-01-31
4  2 2015-02-02 2014-02-01 2016-04-30
5  3 2014-05-04 2009-01-01 2014-02-28
6  4 2014-05-04 2014-03-01 2014-08-31
7  4 2014-05-04 2013-03-01 2013-05-01

The start and end dates are insurance begin and end dates, sometimes there are same start dates for multiple rows because they pertain to different insurance types. I am interested in retaining those IDs that has consistent insurance coverage one year before and after the poi. Each ID can only have 1 poi.

My output would be a list of IDs that have insurance coverage 1 year pre and 1 year post the poi. In this case it would exlude ids 3 and 4 becuase they dont have coverage 1 year after poi.

  ids=c(1,2)

I have tried the following, but honestly have no idea about how I can achieve what I want.

Any help will be really appreciated.

library(rehape2)
df.melt=melt(df,
             id=c("id","poi"))

df.melt=mutate(df.melt, flag=ave(id,id,variable,FUN=seq_along))
df.melt=mutate(df.melt, variable=paste(variable,flag,sep ="_"))
df.cast=dcast(df.melt, id+poi~variable)

Solution

  • If you want to evaluate the rows individually, with dplyr and lubridate:

    library(dplyr)
    library(lubridate)
    
    # filter to only rows with a POI within the desired range
    df %>% filter(poi - years(1) >= start, 
                  poi + years(1) <= end)
    
    #   id        poi      start        end
    # 1  1 2015-01-29 2010-10-01 2016-04-30
    # 2  2 2015-02-02 2014-02-01 2016-04-30
    

    If you'd rather evaluate all the rows for one ID, maybe something like

    # group to summarize IDs separately
    df %>% group_by(id, poi) %>% 
        # collapse rows to min start and max end for each ID
        summarise(start = min(start), 
                  end = max(end)) %>% 
        # filter to only rows with a POI within the desired range
        filter(poi - years(1) >= start, 
               poi + years(1) <= end)
    
    # Source: local data frame [2 x 4]
    # Groups: id [2]
    # 
    #      id        poi      start        end
    #   (dbl)     (date)     (date)     (date)
    # 1     1 2015-01-29 2010-10-01 2016-04-30
    # 2     2 2015-02-02 2009-09-01 2016-04-30
    

    Such an approach would overlook gaps in coverage, though, if that's a possibility. If so, lubridate::interval and int_overlaps may be useful in condensing rows carefully.