Suppose I have a set of ranges by row:
lower | upper |
---|---|
-10.4443200 | -8.695751 |
-10.5356594 | -7.372029 |
-3.9635740 | -2.661712 |
-2.7043889 | -1.051237 |
0.8921994 | 2.525341 |
0.8495998 | 2.982567 |
0.9639315 | 3.149708 |
1.2656724 | 3.362623 |
2.8932368 | 5.332422 |
4.6476099 | 5.489882 |
What is an efficient way to count the number of pairs of ranges that overlap with one another?
One naive way is, but this is slow for millions of comparisons due to the loop. Perhaps a vectorised way using foverlaps
would be ideal.
library(data.table)
setDT(a)
setkey(a, lower, upper)
for (i in 1:nrow(a)) {
for (j in 1:nrow(a)) {
foverlaps(a[i,], a[j,])
}
}
data=structure(list(lower = c(-10.4443200112593, -10.5356593568179,
-3.96357398513697, -2.70438891891616, 0.892199380698278, 0.849599807772024,
0.963931532617852, 1.2656723800301, 2.89323680524585, 4.64760986325676
), upper = c(-8.69575093847071, -7.37202901360451, -2.66171192367237,
-1.05123670198647, 2.5253413373515, 2.98256679223578, 3.14970844448057,
3.3626226637927, 5.33242229071662, 5.48988156249026)), row.names = c(NA,
-10L), class = "data.frame")
An efficient algorithm would be to sort the combined vector of lower and upper bounds, then for each interval add the number of intervals already open when the interval starts to the number of intervals that start while the interval is open. As a function using data.table
:
library(data.table)
noverlaps2 <- function(dt) {
n <- nrow(dt)
dt[
,N := data.table(
val = unlist(dt),
i = rep(c(1L, -1L), each = n),
j = rep(1:0, each = n)
)[
order(val), `:=`(N1 = cumsum(i), N2 = cumsum(j))
][,N1[1:n] - N2[1:n] + N2[(n + 1):.N] - 1L]
]
}
Test on the example data:
noverlaps2(setDT(data))[]
#> lower upper N
#> 1: -10.5356594 -7.372029 1
#> 2: -10.4443200 -8.695751 1
#> 3: -3.9635740 -2.661712 1
#> 4: -2.7043889 -1.051237 1
#> 5: 0.8495998 2.982567 4
#> 6: 0.8921994 2.525341 3
#> 7: 0.9639315 3.149708 4
#> 8: 1.2656724 3.362623 4
#> 9: 2.8932368 5.332422 4
#> 10: 4.6476099 5.489882 1
Compare timings against @Wimpel's foverlaps
solution (as a function):
noverlaps1 <- function(dt) {
setkey(dt, lower, upper)
foverlaps(dt, dt)[, .(N = .N - 1), .(lower, upper)]
}
Larger example dataset, with 10K rows:
data <- data.table(lower = cumsum(rexp(1e4)))
data[,upper := lower + rexp(.N, 0.1)]
Timings:
microbenchmark::microbenchmark(
Wimpel = noverlaps1(dt),
jblood94 = noverlaps2(dt),
check = "equal",
setup = {dt <- copy(data)}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> Wimpel 88.4604 140.10715 223.3559 168.1201 244.86430 828.5056 100
#> jblood94 14.1549 20.38995 42.8457 26.4350 51.31825 387.5831 100
Also note that the performance of noverlaps2
is not affected when the segments tend to overlap with more segments, while foverlaps
becomes much slower:
data <- data.table(lower = cumsum(rexp(1e4)))
data[,upper := lower + rexp(.N, 0.01)]
microbenchmark::microbenchmark(
Wimpel = noverlaps1(dt),
jblood94 = noverlaps2(dt),
check = "equal",
setup = {dt <- copy(data)}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> Wimpel 1000.1568 1200.95605 1411.75152 1353.3996 1555.8535 2252.2487 100
#> jblood94 12.4723 25.99595 50.43143 43.7304 69.1171 157.1523 100
A final note: noverlaps1
will reorder the data if it is not already ordered by lower
, whereas noverlaps2
updates the data.table
in-place without reordering.