Search code examples
rdataframedistanceintervals

Grouping linear intervals by distance cutoff


I have an R data.frame of linear intervals:

df <- data.frame(id = paste0("i",1:15),
                 start = c(6575,7156,7949,45835,46347,47168,126804,127276,128127,157597,158074,158902,199129,199704,200507),
                 end = c(6928,7392,8260,46104,46610,47485,127079,127542,128417,157872,158340,159219,199374,199951,200938))

I also have an inter-interval distance cutoff:

inter.interval.distance.cutoff <- 3243

df is sorted by start and end. The first interval is labeled to belong to group g1 and from there on any interval which is separated by the interval preceding it by a distance (which is defined as start of the current interval minus the end of the interval preceding it) that's equal or less to inter.interval.distance.cutoff is assigned to the group of the interval preceding it, otherwise it starts a new group (the group index is incremented by 1 which is how ew get a new group label).

Here's my desired outcome:

df$group <- c(rep("g1",3), rep("g2",3), rep("g3",3), rep("g4",3), rep("g5",3))

Any fast way for doing it?


Solution

  • df$group <- paste0('g', cumsum(c(1, diff(df$start)>inter.interval.distance.cutoff)))
    
        id  start    end  f
    1   i1   6575   6928 g1
    2   i2   7156   7392 g1
    3   i3   7949   8260 g1
    4   i4  45835  46104 g2
    5   i5  46347  46610 g2
    6   i6  47168  47485 g2
    7   i7 126804 127079 g3
    8   i8 127276 127542 g3
    9   i9 128127 128417 g3
    10 i10 157597 157872 g4
    11 i11 158074 158340 g4
    12 i12 158902 159219 g4
    13 i13 199129 199374 g5
    14 i14 199704 199951 g5
    15 i15 200507 200938 g5