python numpy cython simd hamming-distance

Fastest way to calculate the Hamming distance between 2 binary vectors in Python / Cython / Numpy

I am trying to calculate the Hamming distance between a binary vector and a matrix of binary vectors. The fastest approach I could find is using unsigned 8 bit integers with Numpy:

import numpy as np
np.count_nonzero(data[0] !=  data, axis=1)

However, The problem with this approach is that it first finds all the bits that are different and then sums the number of differences. Wouldn't this be a bit of a waste? I tried implementing a basic version in C++ where I also keep a count of the number of bits that are different such that a sum isn't required at the end but this is way slower. Probably due to the fact that Numpy uses SIMD instructions.

So my question is. Is there a way to directly calculate the Hamming distance using SIMD instructions in Numpy / Python / Cython?

Solution

Ideally, what you actually want the CPU to be doing is sum += popcount( a[i] ^ b[i]) with chunks as large as possible. e.g. on x86, using AVX2 to XOR 32 bytes at a time with one instruction, and then a few more instructions (including vpshufb and vpaddq) to accumulate counts into SIMD vectors of per-element counts (which you horizontally sum at the end).

This is easy with C++ intrinsics for a specific ISA, like x86-64.

You can come close with portable code using std::bitset<64> to XOR 64-bit chunks together and .count() as a portable API to an efficient popcount. Clang can auto-vectorize scalar popcount into AVX2, but GCC can't.

To safely construct it without violating strict-aliasing, you might need to memcpy from arbitrary data of another type into an unsigned long long.

I don't know if Numpy has a loop for this compiled in, otherwise you might need to XOR in one pass and then popcount in another, which would suck for computational intensity so you'd definitely want to cache-block it into small chunks that stay hot in L1d cache before you get back to re-read them.