Search code examples
rdplyrfiltertimerange

Filter rows with time period that falls within another row timeperiod


I have a dataset that resembles the following minimal example:

   Category  date_in                date_out
   KA        2018-09-05            2018-10-08                   
   KA        2018-09-05            2018-09-18                         
   KA        2018-09-18            2018-09-24                    
   KB        2018-09-24            2018-10-08

Using dplyr, how can I eliminate, in each category, the rows with a period (date_in - date_out) that falls within the period of another row of the same category? In the example above, after filtering, we would have

   Category  date_in                date_out
   KA        2018-09-05            2018-10-08                                   
   KB        2018-09-24            2018-10-08

Solution

  • Within each group, take each row and check whether its start date is greater or equal to the start date of any other row in the group. Then check whether its end date is less than the end date of any other row in the group. If both these conditions are TRUE for this row compared to any of the other rows, then this row should be filtered out.

    A pedestrian approach to this would be:

    library(tidyverse)
    
    df %>%
      filter(df %>%
        mutate(across(contains('date'), as.Date)) %>%
        group_split(Category) %>%
        map(~ sapply(seq(nrow(.x)), function(i) {
                !any(.x$date_in[i] >= .x$date_in[-i] & 
                     .x$date_out[i] <= .x$date_out[-i])
          })) %>%
        unlist())
    #>   Category    date_in   date_out
    #> 1       KA 2018-09-05 2018-10-08
    #> 2       KB 2018-09-24 2018-10-08
    

    Data used in above example, taken from question and made reproducible

    df <- structure(list(Category = c("KA", "KA", "KA", "KB"), 
                         date_in = c("2018-09-05", "2018-09-05", "2018-09-18", 
                                     "2018-09-24"), 
                         date_out = c("2018-10-08", "2018-09-18", "2018-09-24", 
                                      "2018-10-08")), 
                    class = "data.frame", row.names = c(NA, -4L))