Search code examples
rstringperformancevectorizationhamming-distance

Find the minimum hamming distance between a string and a long vector of strings (fast)


I need to calculate the hamming distance between an input string and a large string dataset. (All strings in the dataset have the same length of the input string.)

For example, if

input <- "YNYYEY"
dataset <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")

the hamming distance between input and each string in dataset is 1, 1, 3, 0 so the minimum is 0. I have written a function to calculate the hamming distance between two strings:

HD <- function(str1, str2){

   str1 <- as.character(str1)
   str2 <- as.character(str2)

   length.str1 <- nchar(str1)
   length.str2 <- nchar(str2)

   string.temp1 <- c()
   for (i in 1:length.str1){
     string.temp1[i] = substr(str1, start=i, stop=i)
   }
   string.temp2 <- c()
   for (i in 1:length.str2){
     string.temp2[i] = substr(str2, start=i, stop=i)
   }
   return(sum(string.temp1 != string.temp2))
   }

But the dataset is too big so I need to speed it up, do you have any idea that I can do it quickly? Thank you for your help.


Solution

  • At R level you can use strsplit, cbind, !=, colSums and min. They are all "vectorized".

    a <- "YNYYEY"
    b <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")
    A <- strsplit(a, split = "")[[1]]
    #[1] "Y" "N" "Y" "Y" "E" "Y"
    B <- do.call("cbind", strsplit(b, split = ""))
    #     [,1] [,2] [,3] [,4]
    #[1,] "Y"  "Y"  "Y"  "Y" 
    #[2,] "N"  "N"  "N"  "N" 
    #[3,] "Y"  "Y"  "E"  "Y" 
    #[4,] "Y"  "Y"  "N"  "Y" 
    #[5,] "E"  "Y"  "E"  "E" 
    #[6,] "E"  "Y"  "N"  "Y" 
    D <- colSums(A != B)
    #[1] 1 1 3 0
    min(D)
    #[1] 0
    

    This kind of "vectorization" creates many temporary matrices / vectors and uses plenty of RAM. But hopefully it is worthwhile.

    At C / C++ level you can do much much better (see a case study at here), but I am not keen on writing C / C++ code today.


    I come across the stringdist package (there is even a tag). The function stringdist relies on a workhorse routine stringdist:::do_dist, which is written in C. It spares my effort.

    library(stringdist)
    d <- stringdist(a, b, method = "hamming")
    #[1] 1 1 3 0
    min(d)
    #[1] 0
    

    stringdist() runs almost ten times slower than colSum().

    That is really interesting. Probably its C code or R code is doing something else complicated.