I am tasked with calculating hamming distances between 1D binary arrays in two groups - a group of 3000 arrays and a group of 10000 arrays, and every array is 100 items(bits) long. So thats 3000x10000 HD calculations on 100 bit long objects.And all that must be done in at most a dozen minutes
Here's the best of what I came up with
#X - 3000 by 100 bool np.array
#Y - 10000 by 100 bool np.array
hd = []
i=1
for x in X:
print("object nr " + str(i) + "/" + str(len(X)))
arr = np.array([x] * len(Y))
C = Y^arr # just xor this array by all the arrays in the other group simultainously
hd.append([sum(c) for c in C]) #add up all the bits to get the hamming distance
i+=1
return np.array(hd)
And it's still going to take 1-1.5 hours for it to finish. How do I go about making this faster?
You should be able to dramatically improve the summing speed by using numpy
to perform it, rather than using a list comprehension and the built-in sum
function (that takes no advantage of numpy
vectorized operations).
Just replace:
hd.append([sum(c) for c in C])
with:
# Explicitly use uint16 to reduce memory cost; if array sizes might increase
# you can use uint32 to leave some wiggle room
hd.append(C.sum(1, dtype=np.uint16))
which, for a 2D array, will return a new 1D array where each value is the sum of the corresponding row (thanks to specifying it should operate on axis
1
). For example:
>>> arr = np.array([[True,False,True], [False,False,True], [True, True,True]], dtype=np.bool)
>>> arr.sum(1, np.uint16)
array([ 2, 1, 3], dtype=uint16)
Since it performs all the work at the C layer in a single operation without type conversions (instead of your original approach that requires a Python level loop that operates on each row, then an implicit loop that, while at the C layer, must still implicitly convert each numpy
value one by one from np.bool
to Python level int
s just to sum them), this should run substantially faster for the array scales you're describing.
Side-note: While not the source of your performance problems, there is no reason to manually maintain your index value; enumerate
can do that more quickly and easily. Simply replace:
i=1
for x in X:
... rest of loop ...
i+=1
with:
for i, x in enumerate(X, 1):
... rest of loop ...
and you'll get the same behavior, but slightly faster, more concise and cleaner in general.