Search code examples
rdate-rangeoverlapping-matches

How to obtain overlapping values for intervals of dates in R


I have a data frame that looks like this:

w<-read.table(header=TRUE, text="
start.date   end.date   manager
2006-05-01   2007-04-30 a
2006-09-30   2007-12-31 b
1999-09-30   2007-12-31 c
2008-01-01   2012-04-30 d
2008-01-01   2020-02-28 e
2009-05-01   2016-04-08 f")

I'd like to obtain a dataframe which returns which managers were working during each month in the period, so for example

df<-read.table(header=TRUE, text="
month    manager1  manager2  manager3  manager4
01-2006  a         b         c         NA
02-2006  a         b         c         d
03-2006  b         c         d         NA
04-2006  b         d         NA        NA")

I started by defining a function datseq that returns the months between the start.date and the end.date

datseq <- function(t1, t2) { 
  format(seq.Date(from = as.Date(t1,"%Y-%m-%d"), 
             to = as.Date(t2,"%Y-%m-%d"),by="month"), 
         "%m/%Y") 

but then I cannot create a proper loop to obtain the desired result. Thank you in advance to everyone replying!


Solution

  • Since you need to know only the overlap at the month level and not the day level, you can consider that managers have started at day 1 and left at last day of month. This can be achieved using floor_date and ceiling_date from the package lubridate.

    library(lubridate)
    
    w.extended <- w
    
    w.extended$start.date <- floor_date(as.Date(w.extended$start.date), "month")
    w.extended$end.date <- ceiling_date(as.Date(w.extended$end.date), "month") - 1
    
    #List of months
    timeperiod <- seq(min(w.extended$start.date),
                      by = "month", 
                      to = max(w.extended$end.date))
    

    You can then use a %within% b from the package lubridate which can check if a date falls within a list of intervals. Apply this function to each of your months with the intervals you provided.

    df <- data.frame(t(sapply(timeperiod, 
                              function(x){
                                managersWorking <- x %within% interval(w.extended$start.date, 
                                                                       w.extended$end.date)
                                c(as.character(x), managersWorking)
                                })),
                     stringsAsFactors = F)
    
    #Replace the 'character' format of columns to the appropriate one
    df[-1] <- apply(df[-1], 2, as.logical)
    df[,1]<- format(as.Date(df[,1]), "%Y/%m")
    
    colnames(df) <- c("month", paste0("manager.", w$manager))
    
    head(df)
    #    month manager.a manager.b manager.c manager.d manager.e manager.f
    #1 1999/09     FALSE     FALSE      TRUE     FALSE     FALSE     FALSE
    #2 1999/10     FALSE     FALSE      TRUE     FALSE     FALSE     FALSE
    #3 1999/11     FALSE     FALSE      TRUE     FALSE     FALSE     FALSE
    #4 1999/12     FALSE     FALSE      TRUE     FALSE     FALSE     FALSE
    #5 2000/01     FALSE     FALSE      TRUE     FALSE     FALSE     FALSE
    #6 2000/02     FALSE     FALSE      TRUE     FALSE     FALSE     FALSE
    

    Raw data:

    w <- read.table(header=TRUE, text="
    start.date   end.date   manager
    2006-05-01   2007-04-30 a
    2006-09-30   2007-12-31 b
    1999-09-30   2007-12-31 c
    2008-01-01   2012-04-30 d
    2008-01-01   2020-02-28 e
    2009-05-01   2016-04-08 f")
    w
    #  start.date   end.date manager
    #1 2006-05-01 2007-04-30       a
    #2 2006-09-30 2007-12-31       b
    #3 1999-09-30 2007-12-31       c
    #4 2008-01-01 2012-04-30       d
    #5 2008-01-01 2020-02-28       e
    #6 2009-05-01 2016-04-08       f