Search code examples
rstringperformance

Fast pairwise longest common substring from start


Suppose there are many binary strings

x <- c("0100100010101010", "0100110010101010","0111001000","010111")

I am looking for a fast method in R to output a matrix containing the pairwise (excluding self-matches) longest common substring from the start. For example, a solution could look like this

> mySolution(x)
    [,1]     [,2]      [,3]      [,4]
[1,] ""       "01001"   "01"     "010"
[2,] "01001"  ""        "01"     "010"
[3,] "01"     "01"      ""       "01"
[4,] "010"    "010"     "01"     ""

E.g., the matrix at position [1,2] is the longest common substring from the start of x[1] and x[2]. Note that the resulting matrix is symmetric. So, we only need to compute one half of it.

I know that could work with functions like substr,sub,grepl, but since I have many strings, I am looking for a very efficient solution that requires little computation time. Maybe it is an option to convert to binary numbers to improve performance?


Solution

  • A fast solution for large vectors of binary strings. The function flcs below is able to perform the desired operation on 10k binary character strings in about 13 seconds.

    The idea is to sequentially check the kth character of each string and put each string into a 0 bin or a 1 bin. The longest common substring between i and j terminates at or before k if i and j are in different bins.

    flcs returns the endpoint of the pairwise common substring, which can be used to fill the bottom triangle of the desired matrix.

    library(data.table)
    
    flcs <- function(x) {
      y <- lapply(x, \(x) utf8ToInt(x) == 49L)
      nx <- length(x)
      nx1 <- nx - 1L
      lens <- lengths(y)
      dt <- data.table(
        v = unlist(y),
        i = sequence(lens),
        j1 = rep.int(1:nx, lens)
      )[,j2 := (j1 - 1L)*nx]
      # `n` is a vector containing the earliest mismatch between pairs
      n <- pmin(lens[rep.int(1:nx1, nx1:1)], lens[sequence(nx1:1, 2:nx)])
      b <- logical(length(n))
      # create a vector `idx` to map to the indices of `n`
      idx <- integer(nx^2)
      idx[sequence(nx1:1, seq(2, nx^2, nx + 1))] <-
        idx[sequence(nx1:1, seq(nx + 1, nx^2, nx + 1), nx)] <- 1:length(n)
      
      for (k in 1:-sort.int(-lens, 2)[2]) {
        nstop <- idx[dt[i == k, outer(j2[v], j1[!v], "+")]]
        nstop <- nstop[!b[nstop]]
        b[nstop] <- TRUE
        n[nstop] <- k - 1L
        if (all(b)) break
      }
      
      n
    }
    

    Demonstrating:

    x <- c("0100100010101010", "0100110010101010","0111001000","010111")
    flcs(x)
    #> [1] 5 2 3 2 3 2
    

    A function to build the final matrix from the output of flcs:

    buildmat <- function(x, n, symmetric = FALSE) {
      nx <- length(x)
      nx1 <- nx - 1L
      z <- matrix("", nx, nx)
      if (symmetric) {
        z[sequence(nx1:1, seq(2, nx^2, nx + 1))] <-
          z[sequence(nx1:1, seq(nx + 1, nx^2, nx + 1), nx)] <-
          substr(x[rep.int(1:nx1, nx1:1)], 1, n)
      } else {
        z[sequence(nx1:1, seq(2, nx^2, nx + 1))] <-
          substr(x[rep.int(1:nx1, nx1:1)], 1, n)
      }
      
      z
    }
    
    buildmat(x, flcs(x), TRUE)
    #>      [,1]    [,2]    [,3] [,4] 
    #> [1,] ""      "01001" "01" "010"
    #> [2,] "01001" ""      "01" "010"
    #> [3,] "01"    "01"    ""   "01" 
    #> [4,] "010"   "010"   "01" ""
    

    Compare to a solution that compares each string to each other string in parallel.

    library(parallel)
    
    f <- function(x) {
      n <- min(length(x[[1]]), length(x[[2]]))
      out <- which.max(x[[1]][1:n] != x[[2]][1:n])
      if (out == 1L && x[[1]][1] == x[[2]][1]) n + 1L else out
    }
    
    cl <- makeCluster(detectCores() - 1) # 15 cores
    clusterExport(cl, c("f"), environment(f))
    
    flcsPar <- function(x) {
      unlist(parLapply(cl, combn(lapply(x, utf8ToInt), 2, NULL, FALSE), f)) - 1L
    }
    
    buildmat(x, flcsPar(x), TRUE)
    #>      [,1]    [,2]    [,3] [,4] 
    #> [1,] ""      "01001" "01" "010"
    #> [2,] "01001" ""      "01" "010"
    #> [3,] "01"    "01"    ""   "01" 
    #> [4,] "010"   "010"   "01" ""
    

    Time the two functions on a vector of 10k binary strings.

    x <- vapply(1:1e4, \(.) intToUtf8(sample(48:49, sample(10:20, 1), 1)), "")
    system.time(n1 <- flcs(x))
    #>    user  system elapsed 
    #>    9.84    3.41   13.26
    system.time(n2 <- flcsPar(x))
    #>    user  system elapsed 
    #>   69.31   54.53  242.07
    identical(n1, n2)
    #> [1] TRUE
    

    flcs performed nearly 50 million comparisons in about 13 seconds.