Search code examples
rdplyrdate-range

Create time range per group/location based on data set with movements


I have a data set that contains movements of groups between different locations, which looks like this:

library(data.table)
df = data.table(Date = c('01/01/2021', '06/01/2021', '09/01/2021', '10/01/2021', '20/01/2021', '24/01/2021'),
                Group = c('Group A', 'Group A', 'Group B', 'Group B', 'Group B', 'Group A'),
                From = c('NA', 1, 3, 4, 5, 4),
                To = c(1, 'NA', 4, 5, 'NA', 1))

Date        Group    From  To
01/01/2021  Group A  NA    1
06/01/2021  Group A  1     NA
09/01/2021  Group B  3     4
10/01/2021  Group B  4     5
20/01/2021  Group B  5     NA
24/01/2021  Group A  4     1

Now, I would like to create a data table for each group/location combination the time range that was spent at the specific location. In the case the "From" column contains NA, this group just entered the movement process. When the "To" column contains NA, this group exited the movement process. The desired data would then look like:

Group   Date_entry  Date_exit   Location
Group A 01/01/2021  06/01/2021  1
Group B 09/01/2021  10/01/2021  4
Group B 10/01/2021  20/01/2021  5
Group A 24/01/2021  NA          1

Solution

  • I recommend an approach like the following:

    df %>%
      inner_join(
        df, 
        by = c("Group" = "Group", "To" = "From"),
        suffix = c("_F", "_T")
      ) %>%
      select(Group, Date_F, Date_T, Location = To) %>%
      filter(Location != "NA") %>%
      mutate(Date_T = ifelse(Date_F < Date_T, Date_T, NA)) %>%
      group_by(Group, Location, Date_F) %>%
      summarise(Date_T = min(Date_T), .groups = "drop")
    

    This may be a more general approach than you need - it is designed to handle many different rows per Group-Location pair.

    How it works:

    1. Join df to itself, so we have records for all date pairs where a group enters and leaves a location. Note that these dates may be out of order, and dates may be repeated.
    2. Discard non-locations
    3. Replace end-dates that are before start-dates with NA
    4. Find the minimum remaining end-date (that is after the start-date)