I'm looking for options for improving performance of lookups into large (a million plus rows) tables of reference data. The lookups match a key based on ranges in each row of the table. Multiple matches are possible and I can either take a list of all matches or the tightest match (the one with the smallest range).
A little toy example is shown in the code below.
# Create small example dataset with overlapping ranges
#
ip_from <- c(1,2,4,6,7, 7, 8,11)
ip_to <- c(5,3,5,6,9,15,10,15)
vl <- c("A","B","C","D","E","F","G","H")
dt1 <- data.table(ip_from,ip_to,vl,stringsAsFactors=FALSE)
str(dt1)
head(dt1,n=10)
#
# Perform some lookups manually
# A hit is when ip_from<= x <=ip_to
# There can be multiple hits
#
x <- 8
dt2 <- dt1[x>=ip_from & x<=ip_to] #Can produce multiple hits
head(dt2)
y <- dt2[,min(ip_to-ip_from)] #Get smallest difference
dt3 <- dt2[(ip_to-ip_from)==y] #Get hits matching smallest
head(dt3)
dt4 <- dt3[1] #Take first hit if multiples
head(dt4)
#
# Create a function to do the match
# It crudely caters for misses.
#
lkp <- function(dtbl,ky){
dtx <- dtbl[ky>=ip_from & ky<=ip_to]
if (nrow(dtx)!=0) {
y <- dtx[,min(ip_to-ip_from)]
dty <- dtx[(ip_to-ip_from)==y]
dtz <- dty[1]
return(dtz)
}
else {
return(NA)
}
}
#
# Make a set of keys
#
z <- as.data.frame(c(8,6))
colnames(z) <- "Obs"
#
# Do lookups, this gets slow.
#
zz <- rbindlist(apply(z,1,function(x) lkp(dt1,x[1])))
zzz <- cbind(z,zz)
head(zzz)
Performance drops off as the number of keys to lookup increases and I could have cases where there are 10's of thousands of keys. Note that in the more realistic case the high and low values are keys in the data table.
Are there any other ways of doing this that could significantly improve performance?
I would do this:
setDT(z)
dt1[, range := ip_to - ip_from] #calculate all ranges
setorder(dt1, range) #order by ranges
#do a data.table non-equi join taking only the first match
dt1[z, .(Obs = Obs, vl = x.vl, ip_from = x.ip_from, ip_to = x.ip_to),
on = c("ip_from <= Obs", "ip_to >= Obs"), mult = "first"]
# Obs vl ip_from ip_to
#1: 8 E 7 9
#2: 6 D 6 6
For comparison:
print(zzz)
# Obs ip_from ip_to vl
#1 8 7 9 E
#2 6 6 6 D