Search code examples
scalahamming-distance

Hamming Distance on binary Vector in Scala


I would like to have a fast implementation of Hamming distance on binary vectors. I tested it on Array[Byte] rather than Array[Int] thinking that it would be faster but it's not the case. If someone can explain me this behaviour and/or advise me on a better implementation.

def hammingDistanceI(v1:Array[Int], v2:Array[Int]) = {
  v1.zip(v2).count{case(a,b) => a!=b}
}
def hammingDistanceB(v1:Array[Byte], v2:Array[Byte]) = {
  v1.zip(v2).count{case(a,b) => a!=b} 
}

def speedMeasureByte(v:Array[Byte], nbIte:Int) = {
  val t0 = System.nanoTime
  for(i<-0 to nbIte-1) hammingDistanceB(v,v)
  val t1 = System.nanoTime
  (t1-t0)/1000000
}

def speedMeasureInt(v:Array[Int], nbIte:Int) = {
  val t0 = System.nanoTime
  for(i<-0 to nbIte-1) hammingDistanceI(v,v)
  val t1 = System.nanoTime
  (t1-t0)/1000000
}

val v1Int = Array.fill(100)(Random.nextInt(2))
val v1Byte = v1Int.map(_.toByte)

val (tInt, tByte) = (speedMeasureInt(v1Int,1000000),
                     speedMeasureByte(v1Byte,1000000))

// tInt = 1636 ms
// tByte = 3307 ms

Solution

  • I am not sure why byte implementation is slower than the other, but suspect it has to do with the way != is implemented - cpu registers are better equipped to deal with four-byte sequences nowadays than with single bytes.

    The above is just my guess though, don't bet your house on it.

    As for a faster implementation, if your use case is such, where single nanoseconds matter, you'll have to abandon the elegance of scala collections and stick with the old good loops:

     def hd(a: Array[Int], b: Array[Int]) { 
       var c = 0
       var i = 0
       while(i < a.length) { c += a(i)^b(i); i+=1 }
       c
     }
    

    This should be several hundred times faster on average than your implementation.