I would like to have a fast implementation of Hamming distance on binary vectors.
I tested it on Array[Byte]
rather than Array[Int]
thinking that it would be faster but it's not the case.
If someone can explain me this behaviour and/or advise me on a better implementation.
def hammingDistanceI(v1:Array[Int], v2:Array[Int]) = {
v1.zip(v2).count{case(a,b) => a!=b}
}
def hammingDistanceB(v1:Array[Byte], v2:Array[Byte]) = {
v1.zip(v2).count{case(a,b) => a!=b}
}
def speedMeasureByte(v:Array[Byte], nbIte:Int) = {
val t0 = System.nanoTime
for(i<-0 to nbIte-1) hammingDistanceB(v,v)
val t1 = System.nanoTime
(t1-t0)/1000000
}
def speedMeasureInt(v:Array[Int], nbIte:Int) = {
val t0 = System.nanoTime
for(i<-0 to nbIte-1) hammingDistanceI(v,v)
val t1 = System.nanoTime
(t1-t0)/1000000
}
val v1Int = Array.fill(100)(Random.nextInt(2))
val v1Byte = v1Int.map(_.toByte)
val (tInt, tByte) = (speedMeasureInt(v1Int,1000000),
speedMeasureByte(v1Byte,1000000))
// tInt = 1636 ms
// tByte = 3307 ms
I am not sure why byte implementation is slower than the other, but suspect it has to do with the way !=
is implemented - cpu registers are better equipped to deal with four-byte sequences nowadays than with single bytes.
The above is just my guess though, don't bet your house on it.
As for a faster implementation, if your use case is such, where single nanoseconds matter, you'll have to abandon the elegance of scala collections and stick with the old good loops:
def hd(a: Array[Int], b: Array[Int]) {
var c = 0
var i = 0
while(i < a.length) { c += a(i)^b(i); i+=1 }
c
}
This should be several hundred times faster on average than your implementation.