Search code examples
rdata.tablerangeoverlap

How to efficiently calculate the number of overlaps between a set of ranges?


Suppose I have a set of ranges by row:

lower upper
-10.4443200 -8.695751
-10.5356594 -7.372029
-3.9635740 -2.661712
-2.7043889 -1.051237
0.8921994 2.525341
0.8495998 2.982567
0.9639315 3.149708
1.2656724 3.362623
2.8932368 5.332422
4.6476099 5.489882

What is an efficient way to count the number of pairs of ranges that overlap with one another?

One naive way is, but this is slow for millions of comparisons due to the loop. Perhaps a vectorised way using foverlaps would be ideal.

library(data.table)
setDT(a)
setkey(a, lower, upper)

for (i in 1:nrow(a)) {
    for (j in 1:nrow(a)) {
        foverlaps(a[i,], a[j,])
    }
}
data=structure(list(lower = c(-10.4443200112593, -10.5356593568179,
-3.96357398513697, -2.70438891891616, 0.892199380698278, 0.849599807772024,
0.963931532617852, 1.2656723800301, 2.89323680524585, 4.64760986325676
), upper = c(-8.69575093847071, -7.37202901360451, -2.66171192367237,
-1.05123670198647, 2.5253413373515, 2.98256679223578, 3.14970844448057,
3.3626226637927, 5.33242229071662, 5.48988156249026)), row.names = c(NA,
-10L), class = "data.frame")

Solution

  • An efficient algorithm would be to sort the combined vector of lower and upper bounds, then for each interval add the number of intervals already open when the interval starts to the number of intervals that start while the interval is open. As a function using data.table:

    library(data.table)
    
    noverlaps2 <- function(dt) {
      n <- nrow(dt)
      dt[
        ,N := data.table(
          val = unlist(dt),
          i = rep(c(1L, -1L), each = n),
          j = rep(1:0, each = n)
        )[
          order(val), `:=`(N1 = cumsum(i), N2 = cumsum(j))
        ][,N1[1:n] - N2[1:n] + N2[(n + 1):.N] - 1L]
      ]
    }
    

    Test on the example data:

    noverlaps2(setDT(data))[]
    #>           lower     upper N
    #>  1: -10.5356594 -7.372029 1
    #>  2: -10.4443200 -8.695751 1
    #>  3:  -3.9635740 -2.661712 1
    #>  4:  -2.7043889 -1.051237 1
    #>  5:   0.8495998  2.982567 4
    #>  6:   0.8921994  2.525341 3
    #>  7:   0.9639315  3.149708 4
    #>  8:   1.2656724  3.362623 4
    #>  9:   2.8932368  5.332422 4
    #> 10:   4.6476099  5.489882 1
    

    Compare timings against @Wimpel's foverlaps solution (as a function):

    noverlaps1 <- function(dt) {
      setkey(dt, lower, upper)
      foverlaps(dt, dt)[, .(N = .N - 1), .(lower, upper)]
    }
    

    Larger example dataset, with 10K rows:

    data <- data.table(lower = cumsum(rexp(1e4)))
    data[,upper := lower + rexp(.N, 0.1)]
    

    Timings:

    microbenchmark::microbenchmark(
      Wimpel = noverlaps1(dt),
      jblood94 = noverlaps2(dt),
      check = "equal",
      setup = {dt <- copy(data)}
    )
    #> Unit: milliseconds
    #>      expr     min        lq     mean   median        uq      max neval
    #>    Wimpel 88.4604 140.10715 223.3559 168.1201 244.86430 828.5056   100
    #>  jblood94 14.1549  20.38995  42.8457  26.4350  51.31825 387.5831   100
    

    Also note that the performance of noverlaps2 is not affected when the segments tend to overlap with more segments, while foverlaps becomes much slower:

    data <- data.table(lower = cumsum(rexp(1e4)))
    data[,upper := lower + rexp(.N, 0.01)]
    
    microbenchmark::microbenchmark(
      Wimpel = noverlaps1(dt),
      jblood94 = noverlaps2(dt),
      check = "equal",
      setup = {dt <- copy(data)}
    )
    #> Unit: milliseconds
    #>      expr       min         lq       mean    median        uq       max neval
    #>    Wimpel 1000.1568 1200.95605 1411.75152 1353.3996 1555.8535 2252.2487   100
    #>  jblood94   12.4723   25.99595   50.43143   43.7304   69.1171  157.1523   100
    

    A final note: noverlaps1 will reorder the data if it is not already ordered by lower, whereas noverlaps2 updates the data.table in-place without reordering.