I'm trying to quickly access a subset of a large data.table. The data has three columns, all numeric (floating point) all with very little repetition. Two columns are data I'd like to perform a binary search on, and the third column contains the numbers I'm actually interested in. Essentially, I've got (x, y, z) data where I'd like to specify a range in x and a range in y and return all the rows within those ranges.
# Generate some toy data of about the same size as the real data
DT <- data.table(x=runif(2000000), y=runif(2000000), z=runif(2000000))
head(DT)
# x y z
# 1: 0.2675023 0.5725162 0.4162230
# 2: 0.1444540 0.8114941 0.1557195
# 3: 0.3607260 0.8159502 0.9705079
# 4: 0.3370213 0.9217284 0.5269885
# 5: 0.1085204 0.6312943 0.9676716
# 6: 0.1076674 0.1623447 0.1753712
ranges <- data.frame(x_min=runif(10000, max = 0.5), x_max=runif(10000, min = 0.5),
y_min=runif(10000, max = 0.5), y_max=runif(10000, min = 0.5))
head(ranges)
# x_min x_max y_min y_max
# 1 0.43817551 0.6720366 0.28052942 0.6309755
# 2 0.07469295 0.6744950 0.23170272 0.8431767
# 3 0.29520846 0.6991277 0.01882153 0.5162244
# 4 0.10500034 0.8977652 0.04806678 0.9528880
# 5 0.20168728 0.5655350 0.34401695 0.8241058
# 6 0.44158099 0.6739211 0.05359761 0.5832320
Here's a visual example of what I'm trying to do; I want all the points within the red rectangle, where the edges of the rectangle are determined by the max and min of the x and y ranges. However, I have lots of red rectangles that I'll be looping over.
plot(DT$x, DT$y)
rect(xleft = ranges$x_min[1], xright = ranges$x_max[1],
ybottom = ranges$y_min[1], ytop = ranges$y_max[1], border = "red")
Currently, the code I'm working with uses a vector scan rather than a binary search (I think), but does exactly what I'd like it to.
lapply(seq_len(nrow(ranges)), function(i){
DT[x%between%c(ranges[i,]$x_min, ranges[i,]$x_max)&
y%between%c(ranges[i,]$y_min, ranges[i,]$y_max)]
})
However, this is still the slowest step in the process according to profvis
and given that I'm new to the data.table
world I'd like to make sure there's not something obvious I'm missing. As far as I can tell, it could be possible to speed this up using the data.table keys to run a binary search rather than a vector scan. However, I haven't been able to figure out how to search a range rather than a single value.
This question asks something very similar but the best answer (from Matt) indicates that this wasn't doable easily in 2014 when the question was posted. He notes that this kind of problem really requires range join implementation and references a feature request on the GitHub page which has since been resolved (a couple months after opening).
Three years later, the question was updated with the new %between%
functionality that I've already implemented, but I still don't think this uses a binary search on the data. The feature request implied that the ideal solution would be of the form DT[J(id,DT(from,to)),...]
, which is clearly using the J()
syntax to leverage the keys.
Does the %between% syntax actually use binary search under the hood? If not, how can I provide two ranges and still use the speedy binary search functionality?
P.S. dplyr
's filter()
is about 3x slower on the dataset, so that's out.
My understanding is that rolling join uses binary search but only on the last joining key, hence it is not simultaneously possible to perform rolling join on 4 keys. In addition, your values are non-integer in nature and hence it is not possible to pinpoint the 4 corners using binary search.
Having said that, here are some options to speed up the subsetting with non-equi join being the fastest but I face some memory limitation issues with your dimensions:
m0 <- function()
lapply(seq_len(nrow(ranges)), function(i){
DT[x%between%c(ranges[i,]$x_min, ranges[i,]$x_max)&
y%between%c(ranges[i,]$y_min, ranges[i,]$y_max)]
})
m1 <- function()
ranges[, DT[x %between% c(x_min, x_max) & y %between% c(y_min, y_max)], 1L:nrow(ranges)]
m2 <- function() {
setkey(DT, x, y)
setDT(ranges, key=c("x_min", "x_max", "y_min", "y_max"))
DT[ranges, on=.(x>=x_min, x<=x_max, y>=y_min, y<=y_max), allow.cartesian=TRUE, .(x.x, x.y, x.z)]
}
m3 <- function() {
setkey(DT3, x)[, rn := .I]
ranges[, ixmin := DT3[.SD, on=.(x=x_min), roll=-Inf, rn]]
ranges[, ixmax := DT3[.SD, on=.(x=x_max), roll=Inf, rn]]
setkey(DT3, y)
DT3[DT3[ranges, on=.(y>=y_min, y<=y_max),
by=.EACHI, .(rn=rn[rn %between% c(ixmin, ixmax)])], on=.(rn),
.(x, y, z)]
}
microbenchmark::microbenchmark(times=1L, m0(), m1(), m2(), m3())
timings:
Unit: milliseconds
expr min lq mean median uq max neval
m0() 782.6070 782.6070 782.6070 782.6070 782.6070 782.6070 1
m1() 713.9469 713.9469 713.9469 713.9469 713.9469 713.9469 1
m2() 272.6018 272.6018 272.6018 272.6018 272.6018 272.6018 1
m3() 765.3667 765.3667 765.3667 765.3667 765.3667 765.3667 1
data:
library(data.table)
set.seed(0L)
nr <- 2e4L
nrng <- 1e3L
dat <- data.table(x=runif(nr), y=runif(nr), z=runif(nr))
ranges <- data.frame(x_min=runif(nrng, max = 0.5), x_max=runif(nrng, min = 0.5),
y_min=runif(nrng, max = 0.5), y_max=runif(nrng, min = 0.5))
dat[, rn := .I]
DT3 <- copy(dat)
DT <- copy(dat)