I need to calculate the hamming distance between an input string and a large string dataset. (All strings in the dataset have the same length of the input string.)
For example, if
input <- "YNYYEY"
dataset <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")
the hamming distance between input
and each string in dataset
is 1, 1, 3, 0 so the minimum is 0. I have written a function to calculate the hamming distance between two strings:
HD <- function(str1, str2){
str1 <- as.character(str1)
str2 <- as.character(str2)
length.str1 <- nchar(str1)
length.str2 <- nchar(str2)
string.temp1 <- c()
for (i in 1:length.str1){
string.temp1[i] = substr(str1, start=i, stop=i)
}
string.temp2 <- c()
for (i in 1:length.str2){
string.temp2[i] = substr(str2, start=i, stop=i)
}
return(sum(string.temp1 != string.temp2))
}
But the dataset is too big so I need to speed it up, do you have any idea that I can do it quickly? Thank you for your help.
At R level you can use strsplit
, cbind
, !=
, colSums
and min
. They are all "vectorized".
a <- "YNYYEY"
b <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")
A <- strsplit(a, split = "")[[1]]
#[1] "Y" "N" "Y" "Y" "E" "Y"
B <- do.call("cbind", strsplit(b, split = ""))
# [,1] [,2] [,3] [,4]
#[1,] "Y" "Y" "Y" "Y"
#[2,] "N" "N" "N" "N"
#[3,] "Y" "Y" "E" "Y"
#[4,] "Y" "Y" "N" "Y"
#[5,] "E" "Y" "E" "E"
#[6,] "E" "Y" "N" "Y"
D <- colSums(A != B)
#[1] 1 1 3 0
min(D)
#[1] 0
This kind of "vectorization" creates many temporary matrices / vectors and uses plenty of RAM. But hopefully it is worthwhile.
At C / C++ level you can do much much better (see a case study at here), but I am not keen on writing C / C++ code today.
I come across the stringdist
package (there is even a stringdist tag). The function stringdist
relies on a workhorse routine stringdist:::do_dist
, which is written in C. It spares my effort.
library(stringdist)
d <- stringdist(a, b, method = "hamming")
#[1] 1 1 3 0
min(d)
#[1] 0
stringdist()
runs almost ten times slower thancolSum()
.
That is really interesting. Probably its C code or R code is doing something else complicated.