Search code examples
rdataframeaggregate

Combining rows in a data.frame


I have a data.frame that looks like this example data.frame:

df = data.frame(id = c("a","b","c","d","e","f","g"), start = c(10,20,30,40,50,60,70), end = c(15,25,35,45,55,65,75), flag = c(0,1,1,0,1,0,1))
> df
  id start end flag
1  a    10  15    0
2  b    20  25    1
3  c    30  35    1
4  d    40  45    0
5  e    50  55    1
6  f    60  65    0
7  g    70  75    1

It is sorted in ascending order by df$start and then df$end. I'm looking for a function that will merge all rows with df$flag == 1 that are not intervened by rows with df$flag == 0. Merged rows should have their df$id concatenated with a coma separator, their df$start should be that of the row with the min df$start and their df$endshould be that of the row with the max df$end. And finally, df$flag should be 1 for them.

So for this example the return data.frame should be:

res.df = data.frame(id = c("a","b,c","d","e","f","g"), start = c(10,20,40,50,60,70), end = c(15,35,45,55,65,75), flag = c(0,1,0,1,0,1))
> res.df
   id start end flag
1   a    10  15    0
2 b,c    20  35    1
3   d    40  45    0
4   e    50  55    1
5   f    60  65    0
6   g    70  75    1

Solution

  • This is difficult to do with aggregate as the same function gets applied to each column in turn. Multiple calls to aggregate could accomplish this in pieces which are then merged. However, data.table allows for different aggregation functions in the same call:

    library(data.table)
    d <- data.table(df)
    
    d[,list(id=paste(id, collapse=','), start=min(start), end=max(end)), 
      by=list(flag, cumsum(flag==0))
    ]
       flag cumsum  id start end
    1:    0      1   a    10  15
    2:    1      1 b,c    20  35
    3:    0      2   d    40  45
    4:    1      2   e    50  55
    5:    0      3   f    60  65
    6:    1      3   g    70  75
    

    You do end up with a new column which may be removed if desired. The condition on cumsum(flag==0) prevents any rows with flag==0 from being combined in the aggregate, and keeps the non-consecutive flag==1 values separate in the result.