Search code examples
pythonpython-3.xnumpyhamming-distance

pairwise hamming distance between numpy arrays considering non-zero values only


I want to calculate the pairwise hamming distance of a 2D numpy array.

My arrays is

A
array([[-1,  0, -1,  0, -1,  0],
       [ 1,  0,  0,  0,  0,  0],
       [ 0,  0,  1,  1,  1,  0],
       [ 0,  0, -1,  1,  0,  0],
       [ 0,  0,  0,  0, -1,  0]], dtype=int8)

I want to calculate the hamming distance between the rows of A, but considering only non-zero values. If one of the entry is zero, we dont include it in calculation.

My output should be

B
array([[0, 1, 2, 0, 0],
       [1, 0, 0, 0, 0],
       [2, 0, 0, 1, 1],
       [0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0]], dtype=int8) 

Solution

  • If your arrays only have zeros and ones, then you have the following property: r1 * r2 will contain 0 in missing locations, -1 where elements differ, and +1 where they are the same. You therefore want to multiply all possible combinations together, and count the number of entries less than zero for each row.

    You take the product with broadcasting:

    B = np.count_nonzero(A[:, None, :] * A[None, :, :] < 0, axis=-1)
    

    If you need to generalize for values that are not always -1 and +1, you can use a similar trick to explicitly check for equality. For two items a, b, the quantity a * b * (a - b) will be non-zero if and only if both quantities are non-zero and different:

    A1 = A[:, None, :]
    A2 = A[None, :, :]
    B = np.count_nonzero(A1 * A2 * (A1 - A2), axis=-1)
    

    If you want to write the condition out explicitly, you can do

    np.count_nonzero((A1 != A2) & (A1 != 0) & (A2 != 0), axis=-1)