Search code examples
rdatedplyrdata.tablelubridate

Number of active months based on grouped time intervals


I have a table with individuals that perform activities during time intervals (bounded by start and end dates). Some of them have several activities, some others don’t. Sometimes the activities overlap, and sometimes they are separated by a period of inactivity.

I would like to group by individuals and compute the number of months they have been active, no matter the activity.

Here is a small reproducible example:

df <- data.frame(id=c("x", "y", "y", "z", "z"),
             activity=c("a1", "b1", "b2", "c1", "c2"),
             start=c(as.Date("2017-07-01"), 
                     as.Date("2018-04-01"), as.Date("2018-07-01"), 
                     as.Date("2017-07-01"), as.Date("2018-02-01")),
             end=c(as.Date("2018-07-31"), 
                   as.Date("2018-05-31"), as.Date("2018-07-31"),
                   as.Date("2018-02-28"), as.Date("2018-07-31")))

It would look like this on a Gantt chart: enter image description here

Does anyone could think of a data.table or dplyr solution to get the following output?

id | active_months
x  |      13
y  |      3
z  |      13

Solution

  • We can create monthly sequence between start and end date and count unique month values for each id.

    library(dplyr)
    
    df %>%
      group_by(id) %>%
      summarise(months = n_distinct(unlist(purrr::map2(start, end, 
                          ~seq(.x, .y, by = 'month')))))
    
    #   id    months
    #  <chr>  <int>
    #1 x         13
    #2 y          3
    #3 z         13