Search code examples
rapplysapplytapplyhamming-distance

Computing number of bits that are set to 1 for matching rows in terms of hamming distance between two data frames


I have two data frames of same number of columns (but not rows) df1 and df2. For each row in df2, I was able to find the best (and second best) matching rows from df1 in terms of hamming distance, in my previous post. In that post, we have been using the following example data:

set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2))  ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2))  ## 3 rows 2 cols

I now need to compute the number of bits equal to 1 for:

  1. each row in df2
  2. the best matching rows in df1
  3. the second matching rows in df1

The number of bits equal to 1 of an integer a maybe computed as

sum(as.integer(intToBits(a)))

And I have applied this to @ZheyuanLi's original function, so I have got item 1>. However I'm unable to apply the same logic to get item 2> and 3>, by simple modification of @ZheyuanLi's function.

Below are the functions from @ZheyuanLi's with modification:

hmd <- function(x,y) {
    rawx <- intToBits(x)
    rawy <- intToBits(y)
    nx <- length(rawx)
    ny <- length(rawy)
    if (nx == ny) {
        ## quick return
        return (sum(as.logical(xor(rawx,rawy))))
    } else if (nx < ny) {
        ## pivoting
        tmp <- rawx; rawx <- rawy; rawy <- tmp
        tmp <- nx; nx <- ny; ny <- tmp
    }
    if (nx %% ny) stop("unconformable length!") else {
        nc <- nx / ny  ## number of cycles
        return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
    }
}

foo <- function(df1, df2, p = 2) {
    ## check p
    if (p > nrow(df2)) p <- nrow(df2)
    ## transpose for CPU cache friendly code
    xt <- t(as.matrix(df1))
    yt <- t(as.matrix(df2))
    ## after transpose, we compute hamming distance column by column
    ## a for loop is decent; no performance gain from apply family
    n <- ncol(yt)
    id <- integer(n * p)
    d <- numeric(n * p)
    sb <- integer(n)
    k <- 1:p
    for (i in 1:n) {
        set.bits <- sum(as.integer(intToBits(yt[,i])))
        distance <- hmd(xt, yt[,i])
        minp <- order(distance)[1:p]
        id[k] <- minp
        d[k] <- distance[minp]
        sb[i] <- set.bits
        k <- k + p
    }
    ## recode "id", "d" and "sb" into data frame and return
    id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
    colnames(id) <- paste0("min.", 1:p)
    d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
    colnames(d) <- paste0("mindist.", 1:p)
    sb <- as.data.frame(matrix(sb, ncol = 1))  ## no need for byrow as you have only 1 column
    colnames(sb) <- "set.bits.1"
    list(id = id, d = d, sb = sb)
}

Running these gives:

> foo(df1, df2)
$id
  min1 min2  ## row id for best/second best match in df1
1    1    4
2    2    3
3    5    2

$d
  mindist.1 mindist.2  ## minimum 2 hamming distance
1         2         2
2         1         3
3         1         3

$sb
  set.bits.1  ## number of bits equal to 1 for each row of df2
1          3
2          2
3          4

Solution

  • OK, after reading through while re-editing your question (many times!), I think I know what you want. Essentially we need change nothing to hmd(). Your required items 1>, 2>, 3> can all be computed after the for loop in foo().

    To get item 1>, which you called sb, we can use a tapply(). However, your computation of sb along the for loop is fine, so I will not change it. In the following, I will demonstrate the basic procedure to get item 2> and item 3>.

    The id vector inside foo() stores all matching rows in df1:

    id <- c(1, 4, 2, 3, 5, 2)
    

    so we can simply extract those rows of df1 (actually, columns of xt), to compute the number of bits equal to 1. As you can see, there are lots of duplicity in id, so we can only computes on unique(id):

    id0 <- sort(unique(id))
    ## [1] 1 2 3 4 5
    

    We now extract those subset columns of xt:

    sub_xt <- xt[, id0]
    ##    [,1] [,2] [,3] [,4] [,5]
    ## V1    9    3   10    5    6
    ## V2    2    4    8    7    1
    

    To compute the number of bits equal to 1 for each column of sub_xt, we again use tapply() and vectorized approach.

    rawbits <- as.integer(intToBits(as.numeric(sub_xt)))  ## convert sub_xt to binary
    sbxt0 <- unname(tapply(X = rawbits,
                          INDEX =  rep(1:length(id0), each = length(rawbits) / length(id0)),
                          FUN = sum))
    ## [1] 3 3 3 5 3
    

    Now we need to map sbxt0 to sbxt:

    sbxt <- sbxt0[match(id, id0)]
    ## [1] 3 5 3 3 3 3
    

    Then we can convert sbxt to a data frame sb1:

    sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
    colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
    ##   min.1.set.bits.1 min.2.set.bits.1
    ## 1                3                5
    ## 2                3                3
    ## 3                3                3
    

    Finally we can assemble these things up:

    foo <- function(df1, df2, p = 2) {
        ## check p
        if (p > nrow(df2)) p <- nrow(df2)
        ## transpose for CPU cache friendly code
        xt <- t(as.matrix(df1))
        yt <- t(as.matrix(df2))
        ## after transpose, we compute hamming distance column by column
        ## a for loop is decent; no performance gain from apply family
        n <- ncol(yt)
        id <- integer(n * p)
        d <- numeric(n * p)
        sb2 <- integer(n)
        k <- 1:p
        for (i in 1:n) {
            set.bits <- sum(as.integer(intToBits(yt[,i])))
            distance <- hmd(xt, yt[,i])
            minp <- order(distance)[1:p]
            id[k] <- minp
            d[k] <- distance[minp]
            sb2[i] <- set.bits
            k <- k + p
        }
        ## compute "sb1"
        id0 <- sort(unique(id))
        sub_xt <- xt[, id0]
        rawbits <- as.integer(intToBits(as.numeric(sub_xt)))  ## convert sub_xt to binary
        sbxt0 <- unname(tapply(X = rawbits,
                               INDEX =  rep(1:length(id0), each = length(rawbits) / length(id0)),
                               FUN = sum))
        sbxt <- sbxt0[match(id, id0)]
        sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
        colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
        ## recode "id", "d" and "sb2" into data frame and return
        id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
        colnames(id) <- paste0("min.", 1:p)
        d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
        colnames(d) <- paste0("mindist.", 1:p)
        sb2 <- as.data.frame(matrix(sb2, ncol = 1))  ## no need for byrow as you have only 1 column
        colnames(sb2) <- "set.bits.1"
        list(id = id, d = d, sb1 = sb1, sb2 = sb2)
    }
    

    Now, running foo(df1, df2) gives:

    > foo(df1,df2)
    $id
       min.1 min.2
     1     1     4
     2     2     3
     3     5     2
    
     $d
      mindist.1 mindist.2
    1         2         2
    2         1         3
    3         1         3
    
    $sb1
       min.1.set.bits.1 min.2.set.bits.1
     1                3                5
     2                3                3
     3                3                3
    
    $sb2
      set.bits.1
    1          3
    2          2
    3          4
    

    Note that I have renamed the sb you used to sb2.