I have a data.frame
that looks like this example data.frame
:
df = data.frame(id = c("a","b","c","d","e","f","g"), start = c(10,20,30,40,50,60,70), end = c(15,25,35,45,55,65,75), flag = c(0,1,1,0,1,0,1))
> df
id start end flag
1 a 10 15 0
2 b 20 25 1
3 c 30 35 1
4 d 40 45 0
5 e 50 55 1
6 f 60 65 0
7 g 70 75 1
It is sorted in ascending order by df$start
and then df$end
.
I'm looking for a function that will merge all rows with df$flag == 1
that are not intervened by rows with df$flag == 0
. Merged rows should have their df$id
concatenated with a coma separator, their df$start
should be that of the row with the min
df$start
and their df$end
should be that of the row with the max
df$end
. And finally, df$flag
should be 1 for them.
So for this example the return data.frame
should be:
res.df = data.frame(id = c("a","b,c","d","e","f","g"), start = c(10,20,40,50,60,70), end = c(15,35,45,55,65,75), flag = c(0,1,0,1,0,1))
> res.df
id start end flag
1 a 10 15 0
2 b,c 20 35 1
3 d 40 45 0
4 e 50 55 1
5 f 60 65 0
6 g 70 75 1
This is difficult to do with aggregate
as the same function gets applied to each column in turn. Multiple calls to aggregate
could accomplish this in pieces which are then merged. However, data.table
allows for different aggregation functions in the same call:
library(data.table)
d <- data.table(df)
d[,list(id=paste(id, collapse=','), start=min(start), end=max(end)),
by=list(flag, cumsum(flag==0))
]
flag cumsum id start end
1: 0 1 a 10 15
2: 1 1 b,c 20 35
3: 0 2 d 40 45
4: 1 2 e 50 55
5: 0 3 f 60 65
6: 1 3 g 70 75
You do end up with a new column which may be removed if desired. The condition on cumsum(flag==0)
prevents any rows with flag==0
from being combined in the aggregate, and keeps the non-consecutive flag==1
values separate in the result.