Search code examples
pythonnumpyscipysparse-matrix

Pair-wise distance of points, with pair specified by incidence sparse matrix


In my code I have an array of point position and a sparse incidence matrix. For every non-zero ij-element of the matrix I want to calculate the distance between the i-point and j-point.

Currently I'm extracting the indices with the 'nonzero()', however this method sort the output indices, and this sorting is taking most of the execution time of my entire application.

import numpy as np
import scipy as sc
import scipy.sparse

#number of points
n = 100000 


#point positions
pos = np.random.rand(n,3)

#building incidence matrix, defining the pair wise calculation
density = 1e-3
n_entries = int(n**2*density)
indx = np.random.randint(0,n, n_entries)
indy = np.random.randint(0,n, n_entries)

Pairs = scipy.sparse.csc_matrix((np.ones(n_entries) ,(indx,indy)) , (n,n))



#current method
ii,jj = Pairs.nonzero() #this part is slow

d = np.linalg.norm(pos[ii] - pos[jj], axis = 1)

distance_matrix = scipy.sparse.csr_matrix((d , (ii,jj)) , (n,n))

Solution

  • Modified answer

    The main problem is that there are sometimes duplicates in indx, indy. For this reason, we cannot simply do:

    d2 = np.linalg.norm(pos[indx] - pos[indy], axis=1)
    distance_matrix2 = scipy.sparse.csr_matrix((d2, (indx, indy)), (n, n))
    

    ...because then the values at duplicated indices would be multiplied by the number of duplicates.

    The original answer below was first deduplicating indices. But even with the faster pd_unique2_idx, the whole operation ends up to take more time than the original.

    Instead, since you already calculated Pairs (which, if you noticed, also has the duplicate indices problem: Pairs.max() often gives more than 1.0), let us try to get indices of non-zero values without any sorting, just using what is in the structure. Here is a way to do it, following the definition of indptr and indices in the docs:

    ii = Pairs.indices
    jj = np.repeat(np.arange(len(Pairs.indptr) - 1), np.diff(Pairs.indptr))
    

    What you get is the same indices as your own ii, jj, but ordered differently (column-major instead of row-major).

    Timing

    Just looking at the part that gets ii, jj:

    %timeit Pairs.nonzero()
    # 1.27 s ± 5.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit np.repeat(np.arange(len(Pairs.indptr) - 1), np.diff(Pairs.indptr))
    # 41.3 ms ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    In context:

    def orig(pos, Pairs, n):
        #current method
        ii, jj = Pairs.nonzero() #this part is slow
        d = np.linalg.norm(pos[ii] - pos[jj], axis=1)
        distance_matrix = scipy.sparse.csr_matrix((d, (ii, jj)), (n, n))
        return distance_matrix
    
    def modified(pos, Pairs, n):
        # only change: these two lines:
        ii = Pairs.indices
        jj = np.repeat(np.arange(len(Pairs.indptr) - 1), np.diff(Pairs.indptr))
        # unchanged
        d = np.linalg.norm(pos[ii] - pos[jj], axis=1)
        distance_matrix = scipy.sparse.csr_matrix((d, (ii, jj)), (n, n))
        return distance_matrix
    

    On your data (with np.random.seed(0) to start, for repeatable results):

    %timeit orig(pos, Pairs, n)
    # 2.16 s ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit modified(pos, Pairs, n)
    # 1.11 s ± 4.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Correctness

    dm = orig(pos, Pairs, n)
    dmod = modified(pos, Pairs, n)
    
    np.all(np.c_[dm.nonzero()] == np.c_[dmod.nonzero()])
    # True
    
    i, j = dm.nonzero()
    np.all(dm[i, j] == dmod[i, j])
    # True
    

    Original answer

    You could calculate another list of distances in indx, indy order, then make a sparse matrix of that:

    d2 = np.linalg.norm(pos[indx] - pos[indy], axis=1)
    distance_matrix2 = scipy.sparse.csr_matrix((d2, (indx, indy)), (n, n))
    

    However, comparison with your own distance matrix shows some rare but possible differences: they come from the case where you have repeated locations in indx, indy and the corresponding Pairs.toarray()[indx, indy] is the number of repeats. This stems from a well-known "feature" of csc and csr sparse matrices that duplicate elements are summed. E.g. here or here.

    For example:

    >>> indx = np.array([0,1,0])
    >>> indy = np.array([2,0,2])
    >>> scipy.sparse.csc_matrix((np.ones(3), (indx, indy))).toarray()
    array([[0., 0., 2.],
           [1., 0., 0.]])
    

    Unfortunately, the solution proposed by @hpaulj in the posts linked above using todok() do not work anymore with recent versions of scipy.

    Using np.unique() is slow, as it sorts the values. pd.unique() is faster (no sort), but accepts only 1D arrays. Here is a way to do a 2D unique that is fast. Note that we need the indices of the unique rows, so that we can index indx and indy, but also the distance matrix if that was already calculated.

    def pd_unique2_idx(indx, indy):
        df = pd.DataFrame(np.c_[indx, indy])
        return np.nonzero(~df.duplicated().values)
    
    
    # by contrast (slower on long sequences, because it sorts)
    def np_unique2_idx(indx, indy):
        rc = np.vstack([indx, indy]).T.copy()
        dt = rc.dtype.descr * 2
        i = np.unique(rc.view(dt), return_index=True)[1]
        return i
    

    On 1M elements for both indx and indy, I see:

    • pd_unique2_idx: 52.1 ms ± 529 µs per loop
    • np_unique2_idx: 972 ms ± 19.5 ms per loop

    How to use this? Following your setup:

    i = pd_unique2_idx(indx, indy)
    indx = indx[i]
    indy = indy[i]
    d2 = np.linalg.norm(pos[indx] - pos[indy], axis=1)
    distance_matrix2 = scipy.sparse.csr_matrix((d2, (indx, indy)), (n, n))