I have a dataset that resembles the following minimal example:
Category date_in date_out
KA 2018-09-05 2018-10-08
KA 2018-09-05 2018-09-18
KA 2018-09-18 2018-09-24
KB 2018-09-24 2018-10-08
Using dplyr, how can I eliminate, in each category, the rows with a period (date_in - date_out) that falls within the period of another row of the same category? In the example above, after filtering, we would have
Category date_in date_out
KA 2018-09-05 2018-10-08
KB 2018-09-24 2018-10-08
Within each group, take each row and check whether its start date is greater or equal to the start date of any other row in the group. Then check whether its end date is less than the end date of any other row in the group. If both these conditions are TRUE
for this row compared to any
of the other rows, then this row should be filtered out.
A pedestrian approach to this would be:
library(tidyverse)
df %>%
filter(df %>%
mutate(across(contains('date'), as.Date)) %>%
group_split(Category) %>%
map(~ sapply(seq(nrow(.x)), function(i) {
!any(.x$date_in[i] >= .x$date_in[-i] &
.x$date_out[i] <= .x$date_out[-i])
})) %>%
unlist())
#> Category date_in date_out
#> 1 KA 2018-09-05 2018-10-08
#> 2 KB 2018-09-24 2018-10-08
Data used in above example, taken from question and made reproducible
df <- structure(list(Category = c("KA", "KA", "KA", "KB"),
date_in = c("2018-09-05", "2018-09-05", "2018-09-18",
"2018-09-24"),
date_out = c("2018-10-08", "2018-09-18", "2018-09-24",
"2018-10-08")),
class = "data.frame", row.names = c(NA, -4L))