Search code examples
rstringsubstringmemory-efficientstringdist

Fast and efficient substring extraction and comparison in R


I have a problem concerning very fast and efficient comparison between the substrings of two strings in my dataset, which won't run fast enough despite pretty powerful machinery. I have a data.table with 2 columns and about 1.5 billion rows, which has this structure:

library(data.table)
library(stringr)
library(stringi)
library(stringdist)

dt <- data.frame(c("002134", "024345", "176234"), c("002003", "024234", "002004"))
colnames(dt) <- c("class1", "class2")
setDT(dt)

What I want is a function that (1) extracts the first 3 digits from each string by row for both vectors, (2) compares the substrings between the two vectors, and (3) create a new boolean variable that reports whether the two substrings are equal or not.

So the desired result is as follows:

dt$sameclass <- c(TRUE, TRUE, FALSE)
print(dt)
   class1 class2 sameclass
1: 002134 002003      TRUE
2: 024345 024234      TRUE
3: 176234 002004     FALSE

I have tried versions of stringr and stringi both within data.table functionality and without. For comparing the substrings I use stringdist, since to my understanding can be parallelized which would be very beneficial on my server. However, the bottleneck still seems to be the substring extraction.


#stringi + stringdist without data.table: 
dt$redclass1 <- stri_sub(dt$class1, to = 3)
dt$redclass2 <- stri_sub(dt$class2, to = 3)
dt[, classdist := stringdist(a = redclass1, b = redclass2, method = "hamming")] 
dt[, sameclass := (classdist == 0)] 

#stringi + stringdist within data.table: 
dt[, classdist := stringdist(a = stri_sub(dt$class1, to = 3), b = stri_sub(dt$class2, to = 3), method = "hamming")] 
dt[, sameclass := (classdist == 0)] 

#stringr with separate function: 
sameclass <- function(subclass1, subclass2, classdepth){
  truthvalue <- (str_sub(subclass1, end = classdepth) == str_sub(subclass2, end = classdepth))
  return(truthvalue)
} 
dt[, sameclass := sameclass(subclass1 = class1, subclass2 = class2, classdepth = 3), by = seq_len(nrow(dt))]

All versions either run into memory problems or take several hours to a day to run. Since I need to do this repeatedly this does not work for me, and I wanted to ask if you can come up with something faster/ more efficient. Any help would be greatly appreciated!

EDIT

I have benchmarked some of the methods suggested here, which indeed show a substantial speedup:

dt <- data.frame(rep(c("002134", "024345", "176234"), 1000), rep(c("002003", "024234", "002004"), 1000))
colnames(dt) <- c("class1", "class2")
setDT(dt)

times <- microbenchmark(

startswithtest = dt[, startsWith(class2, substring(class1, 1, 3))],
lapplytest = dt[, do.call(`==`, lapply(.SD, substring, 1, 3)), .SDcols = c("class1", "class2")],
numerictest = dt[, as.numeric(class1)%/%1000 == as.numeric(class2)%/%1000],
functiontest = dt[, sameclass(subclass1 = class1, subclass2 = class2, classdepth = 3), by = seq_len(nrow(dt))],
stringitest = dt[, stringdist(a = stri_sub(dt$class1, to = 3), b = stri_sub(dt$class2, to = 3), method = "hamming")], 
times = 50
)
times
           expr       min        lq       mean     median         uq        max neval
 startswithtest   312.501   356.901   593.4530   444.8515    737.301   1692.602    50
     lapplytest   383.602   439.201   736.3512   522.7010    966.901   2259.601    50
    numerictest  1763.100  1932.600  3229.6651  2399.7510   4153.201   8396.301    50
   functiontest 45677.700 61124.002 81567.9409 77844.5510 100084.901 133921.502    50
    stringitest   794.201  1028.200  1423.5289  1259.6005   1739.400   3640.701    50

I will go with the startsWith for now, since it seems to offer the highest speed (unfortunately I was not able to use the C-function due to restrictions on my server). Thank you for the help!


Solution

  • Probably you can try

    > dt[, sameclass := do.call(`==`, lapply(.SD, substring, 1, 3))][]
       class1 class2 sameclass
    1: 002134 002003      TRUE
    2: 024345 024234      TRUE
    3: 176234 002004     FALSE
    

    or maybe faster

    > dt[, sameclass := startsWith(class2, substring(class1, 1, 3))][]
       class1 class2 sameclass
    1: 002134 002003      TRUE
    2: 024345 024234      TRUE
    3: 176234 002004     FALSE