Search code examples
rdata.tablebinary-search

Subsetting a data.table with a binary search by range in two columns


I'm trying to quickly access a subset of a large data.table. The data has three columns, all numeric (floating point) all with very little repetition. Two columns are data I'd like to perform a binary search on, and the third column contains the numbers I'm actually interested in. Essentially, I've got (x, y, z) data where I'd like to specify a range in x and a range in y and return all the rows within those ranges.

# Generate some toy data of about the same size as the real data
DT <- data.table(x=runif(2000000), y=runif(2000000), z=runif(2000000))
head(DT)
#            x         y         z
# 1: 0.2675023 0.5725162 0.4162230
# 2: 0.1444540 0.8114941 0.1557195
# 3: 0.3607260 0.8159502 0.9705079
# 4: 0.3370213 0.9217284 0.5269885
# 5: 0.1085204 0.6312943 0.9676716
# 6: 0.1076674 0.1623447 0.1753712
ranges <- data.frame(x_min=runif(10000, max = 0.5), x_max=runif(10000, min = 0.5),
                     y_min=runif(10000, max = 0.5), y_max=runif(10000, min = 0.5))
head(ranges)
#        x_min     x_max      y_min     y_max
# 1 0.43817551 0.6720366 0.28052942 0.6309755
# 2 0.07469295 0.6744950 0.23170272 0.8431767
# 3 0.29520846 0.6991277 0.01882153 0.5162244
# 4 0.10500034 0.8977652 0.04806678 0.9528880
# 5 0.20168728 0.5655350 0.34401695 0.8241058
# 6 0.44158099 0.6739211 0.05359761 0.5832320

Here's a visual example of what I'm trying to do; I want all the points within the red rectangle, where the edges of the rectangle are determined by the max and min of the x and y ranges. However, I have lots of red rectangles that I'll be looping over.

plot(DT$x, DT$y)
rect(xleft = ranges$x_min[1], xright = ranges$x_max[1],
     ybottom = ranges$y_min[1], ytop = ranges$y_max[1], border = "red")

Example

Currently, the code I'm working with uses a vector scan rather than a binary search (I think), but does exactly what I'd like it to.

lapply(seq_len(nrow(ranges)), function(i){
  DT[x%between%c(ranges[i,]$x_min, ranges[i,]$x_max)&
       y%between%c(ranges[i,]$y_min, ranges[i,]$y_max)]
})

However, this is still the slowest step in the process according to profvis and given that I'm new to the data.table world I'd like to make sure there's not something obvious I'm missing. As far as I can tell, it could be possible to speed this up using the data.table keys to run a binary search rather than a vector scan. However, I haven't been able to figure out how to search a range rather than a single value.

This question asks something very similar but the best answer (from Matt) indicates that this wasn't doable easily in 2014 when the question was posted. He notes that this kind of problem really requires range join implementation and references a feature request on the GitHub page which has since been resolved (a couple months after opening).

Three years later, the question was updated with the new %between% functionality that I've already implemented, but I still don't think this uses a binary search on the data. The feature request implied that the ideal solution would be of the form DT[J(id,DT(from,to)),...], which is clearly using the J() syntax to leverage the keys.

Does the %between% syntax actually use binary search under the hood? If not, how can I provide two ranges and still use the speedy binary search functionality?

P.S. dplyr's filter() is about 3x slower on the dataset, so that's out.


Solution

  • My understanding is that rolling join uses binary search but only on the last joining key, hence it is not simultaneously possible to perform rolling join on 4 keys. In addition, your values are non-integer in nature and hence it is not possible to pinpoint the 4 corners using binary search.

    Having said that, here are some options to speed up the subsetting with non-equi join being the fastest but I face some memory limitation issues with your dimensions:

    m0 <- function()
        lapply(seq_len(nrow(ranges)), function(i){
            DT[x%between%c(ranges[i,]$x_min, ranges[i,]$x_max)&
                    y%between%c(ranges[i,]$y_min, ranges[i,]$y_max)]
        })
    
    m1 <- function()
        ranges[, DT[x %between% c(x_min, x_max) & y %between% c(y_min, y_max)], 1L:nrow(ranges)]
    
    m2 <- function() {
        setkey(DT, x, y)
        setDT(ranges, key=c("x_min", "x_max", "y_min", "y_max"))
        DT[ranges, on=.(x>=x_min, x<=x_max, y>=y_min, y<=y_max), allow.cartesian=TRUE, .(x.x, x.y, x.z)]
    }
    
    m3 <- function() {
        setkey(DT3, x)[, rn := .I]
        ranges[, ixmin := DT3[.SD, on=.(x=x_min), roll=-Inf, rn]]
        ranges[, ixmax := DT3[.SD, on=.(x=x_max), roll=Inf, rn]]
    
        setkey(DT3, y)
        DT3[DT3[ranges, on=.(y>=y_min, y<=y_max),
            by=.EACHI, .(rn=rn[rn %between% c(ixmin, ixmax)])], on=.(rn),
            .(x, y, z)]
    }
    
    microbenchmark::microbenchmark(times=1L, m0(), m1(), m2(), m3())
    

    timings:

    Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval
     m0() 782.6070 782.6070 782.6070 782.6070 782.6070 782.6070     1
     m1() 713.9469 713.9469 713.9469 713.9469 713.9469 713.9469     1
     m2() 272.6018 272.6018 272.6018 272.6018 272.6018 272.6018     1
     m3() 765.3667 765.3667 765.3667 765.3667 765.3667 765.3667     1
    

    data:

    library(data.table)
    set.seed(0L)
    nr <- 2e4L
    nrng <- 1e3L
    dat <- data.table(x=runif(nr), y=runif(nr), z=runif(nr))
    ranges <- data.frame(x_min=runif(nrng, max = 0.5), x_max=runif(nrng, min = 0.5),
        y_min=runif(nrng, max = 0.5), y_max=runif(nrng, min = 0.5))
    dat[, rn := .I]
    
    DT3 <- copy(dat)
    DT <- copy(dat)