Search code examples
pythonarrayshamming-distance

What is the fastest way to XOR A LOT of binary arrays in python?


I am tasked with calculating hamming distances between 1D binary arrays in two groups - a group of 3000 arrays and a group of 10000 arrays, and every array is 100 items(bits) long. So thats 3000x10000 HD calculations on 100 bit long objects.And all that must be done in at most a dozen minutes

Here's the best of what I came up with

#X - 3000 by 100 bool np.array 
#Y - 10000 by 100 bool np.array
hd = []
i=1
for x in X:
    print("object nr " + str(i) + "/" + str(len(X)))
    arr = np.array([x] * len(Y)) 
    C = Y^arr # just xor this array by all the arrays in the other group simultainously
    hd.append([sum(c) for c in C]) #add up all the bits to get the hamming distance
    i+=1

return np.array(hd)

And it's still going to take 1-1.5 hours for it to finish. How do I go about making this faster?


Solution

  • You should be able to dramatically improve the summing speed by using numpy to perform it, rather than using a list comprehension and the built-in sum function (that takes no advantage of numpy vectorized operations).

    Just replace:

    hd.append([sum(c) for c in C])
    

    with:

    # Explicitly use uint16 to reduce memory cost; if array sizes might increase
    # you can use uint32 to leave some wiggle room
    hd.append(C.sum(1, dtype=np.uint16))
    

    which, for a 2D array, will return a new 1D array where each value is the sum of the corresponding row (thanks to specifying it should operate on axis 1). For example:

    >>> arr = np.array([[True,False,True], [False,False,True], [True, True,True]], dtype=np.bool)
    >>> arr.sum(1, np.uint16)
    array([ 2, 1, 3], dtype=uint16)
    

    Since it performs all the work at the C layer in a single operation without type conversions (instead of your original approach that requires a Python level loop that operates on each row, then an implicit loop that, while at the C layer, must still implicitly convert each numpy value one by one from np.bool to Python level ints just to sum them), this should run substantially faster for the array scales you're describing.

    Side-note: While not the source of your performance problems, there is no reason to manually maintain your index value; enumerate can do that more quickly and easily. Simply replace:

    i=1
    for x in X:
        ... rest of loop ...
        i+=1
    

    with:

    for i, x in enumerate(X, 1):
        ... rest of loop ...
    

    and you'll get the same behavior, but slightly faster, more concise and cleaner in general.