Search code examples
pythonscipysparse-matrix

How to sum a sparse matrix rows in the first column, and zero the other colums, with the same dimensions of the original matrix?


I have a sparse matrix B, I want to get the sparse matrix A from B by summation of all rows in the first column, then dividing the first column by '2', and making the other columns zero.

from numpy import array
from scipy import csr_matrix

row = array([0,0,1,2,2,2])
col = array([0,2,2,0,1,2])
data = array([1,2,3,4,5,6])

B = csr_matrix( (data,(row,col)), shape=(3,3) )

A = B.copy()

A = A.sum(axis=1)/2
# A shape becomes 1 x 3 instead of 3 x 3 here!

Solution

  • I think this could be approached in several ways. Your's is fine.

    In [275]: from scipy.sparse import csr_matrix 
         ...:  
         ...: row = np.array([0,0,1,2,2,2]) 
         ...: col = np.array([0,2,2,0,1,2]) 
         ...: data = np.array([1,2,3,4,5,6.])    # make float 
         ...:  
         ...: B = csr_matrix( (data,(row,col)), shape=(3,3) )                                              
    In [276]: A = B.copy()                                                                                 
    In [277]: A                                                                                            
    Out[277]: 
    <3x3 sparse matrix of type '<class 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Row format>
    

    The assignment works:

    In [278]: A[:,0]  = A.sum(axis=1)/2                                                                    
    /usr/local/lib/python3.6/dist-packages/scipy/sparse/_index.py:126: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
      self._set_arrayXarray(i, j, x)
    In [279]: A[:,1:] = 0                                                                                  
    /usr/local/lib/python3.6/dist-packages/scipy/sparse/_index.py:126: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
      self._set_arrayXarray(i, j, x)
    In [280]: A                                                                                            
    Out[280]: 
    <3x3 sparse matrix of type '<class 'numpy.float64'>'
        with 9 stored elements in Compressed Sparse Row format>
    
    In [283]: A.eliminate_zeros()                                                                          
    In [284]: A                                                                                            
    Out[284]: 
    <3x3 sparse matrix of type '<class 'numpy.float64'>'
        with 3 stored elements in Compressed Sparse Row format>
    In [285]: A.A                                                                                          
    Out[285]: 
    array([[1.5, 0. , 0. ],
           [1.5, 0. , 0. ],
           [7.5, 0. , 0. ]])
    

    The efficiency warning is mainly intended to discourage iterative or repeated assignments. I think that for one-time actions like this they can be ignored.

    Or if we start with an all-zero A:

    In [286]: A = csr_matrix(np.zeros(B.shape))   # may be better method                                                         
    In [287]: A[:,0]  = B.sum(axis=1)/2                                                                    
    /usr/local/lib/python3.6/dist-packages/scipy/sparse/_index.py:126: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
      self._set_arrayXarray(i, j, x)
    In [288]: A                                                                                            
    Out[288]: 
    <3x3 sparse matrix of type '<class 'numpy.float64'>'
        with 3 stored elements in Compressed Sparse Row format>
    

    Alternatively that column sum matrix could be used to construct A directly, using the same style of inputs as used to define B:

    In [289]: A1  = B.sum(axis=1)/2                                                                        
    In [290]: A1                                                                                           
    Out[290]: 
    matrix([[1.5],
            [1.5],
            [7.5]])
    In [296]: row = np.arange(3)                                                                           
    In [297]: col = np.zeros(3,int)                                                                        
    In [298]: data = A1.A1                                                                                 
    In [299]: A = csr_matrix((data, (row, col)), shape=(3,3))                                              
    In [301]: A                                                                                            
    Out[301]: 
    <3x3 sparse matrix of type '<class 'numpy.float64'>'
        with 3 stored elements in Compressed Sparse Row format>
    In [302]: A.A                                                                                          
    Out[302]: 
    array([[1.5, 0. , 0. ],
           [1.5, 0. , 0. ],
           [7.5, 0. , 0. ]])
    

    I don't know which approach is fastest. Your sparse.hstack looks nice, though under the covers, hstack is building the row,col,data arrays from the coo formats, and making a new coo_matrix. While it is reliable, it's not particularly streamlined.