Search code examples
rdata.tableevent-log

Merging values of two rows based on conditions using data.table in R


In a dataset in which I have an activity, start and stop times, and an id, I want to merge values of two rows in the same column and update other columns if several conditions apply. First a data example:

library(data.table)
DT <- data.table(person=c(1,1,1,1,2,2,2,2,2,3,3,3,3),
             activity=c("grab", "walk", "remove", "delete", "run", "talk", "walk", "remove",
                        "grab", "walk", "delete", "talk", "remove"),
             start_time=c(0,1,3,6,0,2,2,3,3,3,6,6,7), stop_time=c(1,3,5,7,1,4,4,8,4,5,7,7,8))
 DT

I want to update start and stop times and merge column 'activity' for each person if:

  • activities are conducted in parallel. Specifically, if the start_time of the following activity is before the stop_time of the preceeding activity activity for that person. Or:
  • If either the start or stop times of activities by one person are identical.

The updated row should reflect the start and stop times of the combined activity, and all rows, except the one updated should be removed. Below is the goal I want to achieve with the data sample I provided:

DT.goal <- data.table(person=c(1,1,2,2,3,3),
                  activity=c("grab + walk + remove", "delete", "run", "talk + walk + grab + remove",
                             "walk", "delete + talk + remove"),
                  start_time=c(0,6,0,2,3,6), stop_time=c(5,7,1,8,5,8))
DT.goal

So far I've come up with the following incomplete attempt:

DT.test <- DT[start_time <= shift(stop_time, 1L, type="lag"), 
          cond := T, by=person]
DT.test <- DT.test[cond==T, 
          new_activity := paste(activity, shift(activity, 1L, type="lag")), by=person]
DT.test <- DT.test[, new_start := start_time, by=person][cond==T, new_start := min(start_time), by=person]
DT.test <- DT.test[, new_stop := stop_time, by=person][cond==T, new_stop := max(stop_time), by=person]

However, using the shift(, type="lag) is not very useful for the first row for each person, as it has no previous row to look at. Also, paste() pastes NA if the condition does not evaluate to TRUE.

Can anyone help me on my way?


Solution

  • Please check the following:

    library(data.table)
    DT <- data.table(person=c(1,1,1,1,2,2,2,2,2,3,3,3,3),
                     activity=c("grab", "walk", "remove", "delete", "run", "talk", "walk", "remove",
                                "grab", "walk", "delete", "talk", "remove"),
                     start_time=c(0,1,3,6,0,2,2,3,3,3,6,6,7), stop_time=c(1,3,5,7,1,4,4,8,4,5,7,7,8))
    
    setorder(DT, person, start_time)
    DT[, concatenate := start_time %in% stop_time | stop_time %in% start_time | duplicated(start_time) | duplicated(start_time, fromLast=TRUE), by = "person"]
    DT[, concatenate_grp := rleid(concatenate), by = "person"]
    DT[, paste(activity, collapse = " + "), by = c("person", "concatenate_grp")]
    DT.goal <- DT[, .(activity = paste(activity, collapse = " + "), start_time = min(start_time), stop_time = max(stop_time)), by = c("person", "concatenate_grp")][, concatenate_grp := NULL]
    

    Which results in:

       person                    activity start_time stop_time
    1:      1        grab + walk + remove          0         5
    2:      1                      delete          6         7
    3:      2                         run          0         1
    4:      2 talk + walk + remove + grab          2         8
    5:      3                        walk          3         5
    6:      3      delete + talk + remove          6         8