Search code examples
rapplysapplytapplyhamming-distance

Computing pairwise Hamming distance between all rows of two integer matrices/data frames


I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.

I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:

x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
       92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
       24197, 610187, 402471, 157122, 866381, 582868, 878)

y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
       92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
       711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)

xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)

distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))

and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.

Any ideas how to do this more efficiently? In the end I need to append to df2:

  • a column with the row id from df1 that gives the lowest distance;
  • a column with the lowest distance;
  • a column with the row id from df1 that gives the 2nd lowest distance;
  • a column with the second lowest distance.

Thanks.


Solution

  • Fast computation of hamming distance between two integers vectors of equal length

    As I said in my comment, we can do:

    hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
    

    to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!

    For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)


    Fast hamming distance between a matrix and a vector

    If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.

    hmd <- function(x,y) {
      rawx <- intToBits(x)
      rawy <- intToBits(y)
      nx <- length(rawx)
      ny <- length(rawy)
      if (nx == ny) {
        ## quick return
        return (sum(as.logical(xor(rawx,rawy))))
        } else if (nx < ny) {
        ## pivoting
        tmp <- rawx; rawx <- rawy; rawy <- tmp
        tmp <- nx; nx <- ny; ny <- tmp
        }
      if (nx %% ny) stop("unconformable length!") else {
        nc <- nx / ny  ## number of cycles
        return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
        }
      }
    

    Note that:

    1. hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
    2. there is no obvious loop here; instead, we use tapply().

    Fast hamming distance computation between two matrices/data frames

    This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.

    foo <- function(df1, df2, p) {
      ## check p
      if (p > nrow(df2)) p <- nrow(df2)
      ## transpose for CPU cache friendly code
      xt <- t(as.matrix(df1))
      yt <- t(as.matrix(df2))
      ## after transpose, we compute hamming distance column by column
      ## a for loop is decent; no performance gain from apply family
      n <- ncol(yt)
      id <- integer(n * p)
      d <- numeric(n * p)
      k <- 1:p
      for (i in 1:n) {
        distance <- hmd(xt, yt[,i])
        minp <- order(distance)[1:p]
        id[k] <- minp
        d[k] <- distance[minp]
        k <- k + p
        }
      ## recode "id" and "d" into data frame and return
      id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
      colnames(id) <- paste0("min.", 1:p)
      d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
      colnames(d) <- paste0("mindist.", 1:p)
      list(id = id, d = d)
      }
    

    Note that:

    1. transposition is done at the beginning, according to reasons before;
    2. a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).

    Experiment

    This part uses small dataset to test/demonstrate our functions.

    Some toy data:

    set.seed(0)
    df1 <- as.data.frame(matrix(sample(1:10), ncol = 2))  ## 5 rows 2 cols
    df2 <- as.data.frame(matrix(sample(1:6), ncol = 2))  ## 3 rows 2 cols
    

    Test hmd first (needs transposition):

    hmd(t(as.matrix(df1)), df2[1, ])  ## df1 & first row of df2
    # [1] 2 4 6 2 4
    

    Test foo:

    foo(df1, df2, p = 2)
    
    # $id
    #   min1 min2
    # 1    1    4
    # 2    2    3
    # 3    5    2
    
    # $d
    #   mindist.1 mindist.2
    # 1         2         2
    # 2         1         3
    # 3         1         3
    

    If you want to append some columns to df2, you know what to do, right?