Search code examples
pythonperformancescipysparse-matrix

Python - Efficient Function with scipy sparse Matrices


for a project, I need an efficient function in python that solves to following task:

Given a very large List X of long sparse Vectors (=> big sparse Matrix) and another Matrix Y that contains a single Vector y, I want a List of "distances", that y has to every Element of X. Hereby the "distance" is defined like this:

Compare each Element of the two Vectors, always take the lower one and sum them up.

Example:

X = [[0,0,2],   
     [1,0,0],
     [3,1,0]]

Y = [[1,0,2]]

The function should return dist = [2,1,1]

In my project, both X and Y contain a lot of zeros and come in as an instance of:

<class 'scipy.sparse.csr.csr_matrix'>

So far so good and I managed to write a functions that solves this task, but is very slow and horrible inefficient. I need some tips on how to efficienty process/iterate the sparse Matrices. This is my function:

def get_distances(X, Y):
   Ret=[]
   rows, cols = X.shape  

   for i in range(0,rows):
       dist = 0                
       sample = X.getrow(i).todense()
       test = Y.getrow(0).todense()    
       rows_s, cols_s = sample.shape     
       rows_t, cols_t = test.shape 

       for s,t in zip(range(0, cols_s), range(0, cols_t)):
           dist += min(sample[0,s], test[0,t])

       X_ret.append([dist])    

   return ret

To do my Operations, I convert the sparse matrices to dense matrices which is of course horrible, but I did not know how to do it better. Do you know how to improve my code and make the function faster?

Thank you a lot!


Solution

  • I revised your function and ran it in

    import numpy as np
    from scipy import sparse
    
    def get_distances(X, Y):
       ret=[]
       for row in X:            
           sample = row.A
           test = Y.getrow(0).A   
           dist = np.minimum(sample[0,:], test[0,:]).sum()
           ret.append(dist)    
       return ret
    
    X = [[0,0,2],   
         [1,0,0],
         [3,1,0]]
    
    Y = [[1,0,2]]
    
    XM = sparse.csr_matrix(X)
    YM = sparse.csr_matrix(Y)
    
    print( get_distances(XM,YM))
    
    print (np.minimum(XM.A, YM.A).sum(axis=1))
    

    producing

    1255:~/mypy$ python3 stack37056258.py 
    [2, 1, 1]
    [2 1 1]
    

    np.minimum takes element wise minimum of two arrays (may be 2d), so I don't need to iterate on columns. I also don't need to use indexing.

    minimum is also implemented for sparse matrices, but I get a segmenation error when I try to apply it to your X (with 3 rows) and Y (with 1). If they are the same size this works:

    Ys = sparse.vstack((YM,YM,YM))
    print(Ys.shape)
    print (XM.minimum(Ys).sum(axis=1))
    

    Converting the single row matrix to an array also gets around the error - because it ends up using the dense version, np.minimum(XM.todense(), YM.A).

    print (XM.minimum(YM.A).sum(axis=1))
    

    When I try other element by element operations on these 2 matrices I get ValueError: inconsistent shapes, e.g. XM+YM, or XM<YM. Looks like sparse does not implement broadcasting as numpy arrays does.

    =======================

    Comparison of ways of replicating a 1 row sparse matrix many times

    In [271]: A=sparse.csr_matrix([0,1,0,0,1])
    
    In [272]: timeit sparse.vstack([A]*3000).A
    10 loops, best of 3: 32.3 ms per loop
    
    In [273]: timeit sparse.kron(A,np.ones((3000,1),int)).A
    1000 loops, best of 3: 1.27 ms per loop
    

    For many times, kron is better than vstack.

    =======================

    There's an overlap in issues with Scipy sparse matrix alternative for getrow()