Search code examples
pythonnumpyscipyvectorizationsparse-matrix

Speeding up sparse cross difference in Python


I need to compute the pairwise distances between two sets of vectors, but I only need to know a small proportion of those pairwise distances. This proportion will be less than 0.01 when my two sets of vectors are large enough.

This is my current solution with a small example. A and B are the two sets of vectors, and M stores the pairs of vectors I need to retain. The number of rows in both A and B will typically be much larger (several thousand). It might also be worth mentioning that M has at least one nonzero value in each row, but some columns may only contain zeros.

import numpy as np

A = np.array([[1, 2], [2, 3], [3, 4]])                              # (3, 2)
B = np.array([[4, 5], [5, 6], [6, 7], [7, 8], [8, 9]])              # (5, 2)
M = np.array([[0, 0, 0, 1, 0], [1, 1, 0, 0, 0], [0, 0, 0, 0, 1]])   # (3, 5)

diff = A[:,None] - B[None,:]                                        # (3, 5, 2)
distances = np.linalg.norm(diff, ord=2, axis=2)                     # (3, 5)
masked_distances = distances * M                                    # (3, 5)

This code works, but it calculates a lot of norms that I don't need, so it's performing a lot of unnecessary computation, especially as A and B become larger. Is there a more efficient way to only compute the norms that I need? I'm open to suggestions for other packages.

I have thought about using sparse arrays to compute masked_distances, but I'm having problems with sparse arrays in scipy with more than 2 dimension (i.e. diff).

I've also tried creating a vectorised function. This also works, but it's even slower when A and B grow to have thousands of records.

def conditional_diff(a, b, m):
    if m:
        return a-b
    else:
        return 0

conditional_diff_vec = np.vectorize(conditional_diff, otypes=[float])
masked_diff = conditional_diff_vec(A[:,None], B[None,:], M[:,:,None])
masked_distances = norm(masked_diff, ord=2, axis=2)

Solution

  • I would solve this by using numba to construct a Compressed Sparse Row matrix. The CSR matrix allows me to avoid using memory to represent the many zeros in the matrix.

    Numba allows me to use an explicit loop without losing too much performance. This allows me to use an if to avoid computing unused distances.

    import numba as nb
    import numpy as np
    import scipy
    import math
    
    
    A_big = np.random.rand(2000, 10)
    B_big = np.random.rand(4000, 10)
    M_big = np.random.rand(A_big.shape[0], B_big.shape[0]) < 0.001
    
    
    @nb.njit()
    def euclidean_distance(vec_a, vec_b):
        acc = 0.0
        for i in range(vec_a.shape[0]):
            acc += (vec_a[i] - vec_b[i]) ** 2
        return math.sqrt(acc)
    
    
    @nb.njit()
    def masked_distance_inner(data, indicies, indptr, matrix_a, matrix_b, mask):
        write_pos = 0
        N, M = matrix_a.shape[0], matrix_b.shape[0]
        for i in range(N):
            for j in range(M):
                if mask[i, j]:
                    # Record distance value
                    data[write_pos] = euclidean_distance(matrix_a[i], matrix_b[j])
                    # Record what column this distance value should be in
                    indicies[write_pos] = j
                    write_pos += 1
            # Record how many entries are in this row
            indptr[i + 1] = write_pos
        # Assert that we wrote to every part of un-initialized memory
        assert write_pos == data.shape[0]
        assert write_pos == indicies.shape[0]
        # Data is returned by modifying parameters to function
    
    
    def masked_distance(matrix_a, matrix_b, mask):
        N, M = matrix_a.shape[0], matrix_b.shape[0]
        assert mask.shape == (N, M)
        # Convert mask to bool
        mask = mask != 0
        # How many entries will this sparse array have?
        sparse_length = mask.sum()
        # Set up CSR matrix
        # data and indicies do not need to be zero-initialized,
        # and it is faster not to initialize them
        # Holds data values
        data = np.empty(sparse_length, dtype='float64')
        # Holds column that each data value is in
        indicies = np.empty(sparse_length, dtype='int64')
        # Holds the position of the elements of each row within data and indicies
        indptr = np.zeros(N + 1, dtype='int64')
        masked_distance_inner(data, indicies, indptr, matrix_a, matrix_b, mask)
        return scipy.sparse.csr_matrix((data, indicies, indptr), shape=(N, M))
    
    
    %timeit masked_distance(A_big, B_big, M_big)
    

    A few notes:

    • I found that calculating the euclidean distance in numba was faster than np.linalg.norm(A[i] - B[i]), and still returned the same result. That's why the euclidean_distance() function is present.

    • masked_distance() is responsible for setting up the algorithm. masked_distance_inner() is all of the speed-critical stuff.

    • I benchmarked this versus the algorithm in the question, and found that for an input of A with shape (2000, 10) and B (4000, 10), with M having 0.1% elements set to True, it was roughly 40x faster.

      13.5 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
      556 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      

      The exact speedup will depend on the sizes of the matrices involved and how sparse the mask is. For example, I benchmarked the same problem with vectors of length 1000, and found a 1000x speedup.

    • I checked the correctness of this algorithm against the original using np.allclose().

    • You could probably make this faster by using smaller dtypes. You could replace float64 with float32 if you only need 7 digits of accuracy for the distances. You could replace int64 with int32 if you know that the dimensions of the matrix are smaller than 231 and that there will never be more than 231 defined elements.