Search code examples
pythonscipysparse-matrixmatrix-indexing

Why doesn't scipy.sparse.csc_matrix preserve the indexing order of my np.array?


I am writing code to remove multiple columns from several large, parallel scipy sparse.csc matrices (meaning all matrices have the same dim, and all nnz elements are in the same places) simultaneously and efficiently. I am doing this by indexing to only the columns I want to keep for one matrix and then reusing the indices and indptr lists for the others. However, when I index the csc matrix by a list, it reorders the data list, so I cannot reuse the indices. Is there a way to force scipy to keep the data list in the original order? Why is it reordering only when indexing by a list?

import scipy.sparse
import numpy as np
mat = scipy.sparse.csc_matrix(np.array([[1,0,0,0,2,5], 
                                        [1,0,1,0,0,0], 
                                        [0,0,0,4,0,1],
                                        [0,3,0,1,0,4]]))
print mat[:,3].data

returns array([4, 1])

print mat[:,[3]].data

returns array([1, 4])


Solution

  • In [43]: mat = sparse.csc_matrix(np.array([[1,0,0,0,2,5],[1,0,1,0,0,0],[0,0,0,4,
        ...: 0,1],[0,3,0,1,0,4]])) 
        ...:  
        ...:                                                                        
    In [44]: mat                                                                    
    Out[44]: 
    <4x6 sparse matrix of type '<class 'numpy.int64'>'
        with 10 stored elements in Compressed Sparse Column format>
    In [45]: mat.data                                                               
    Out[45]: array([1, 1, 3, 1, 4, 1, 2, 5, 1, 4], dtype=int64)
    In [46]: mat.indices                                                            
    Out[46]: array([0, 1, 3, 1, 2, 3, 0, 0, 2, 3], dtype=int32)
    In [47]: mat.indptr                                                             
    Out[47]: array([ 0,  2,  3,  4,  6,  7, 10], dtype=int32)
    

    scalar selection:

    In [48]: m1 = mat[:,3]                                                          
    In [49]: m1                                                                     
    Out[49]: 
    <4x1 sparse matrix of type '<class 'numpy.int64'>'
        with 2 stored elements in Compressed Sparse Column format>
    In [50]: m1.data                                                                
    Out[50]: array([4, 1])
    In [51]: m1.indices                                                             
    Out[51]: array([2, 3], dtype=int32)
    In [52]: m1.indptr                                                              
    Out[52]: array([0, 2], dtype=int32)
    

    list indexing:

    In [53]: m2 = mat[:,[3]]                                                        
    In [54]: m2.data                                                                
    Out[54]: array([1, 4], dtype=int64)
    In [55]: m2.indices                                                             
    Out[55]: array([3, 2], dtype=int32)
    In [56]: m2.indptr                                                              
    Out[56]: array([0, 2], dtype=int32)
    

    sorting:

    In [57]: m2.sort_indices()                                                      
    In [58]: m2.data                                                                
    Out[58]: array([4, 1], dtype=int64)
    In [59]: m2.indices                                                             
    Out[59]: array([2, 3], dtype=int32)
    

    csc indexing with a list uses matrix multiplication. It constructs an extractor matrix based on the index, and then does the dot multiply. So it's a brand new sparse matrix; not just a subset of the csc data and index attributes.

    csc matrices have a method to ensure the indicies values are ordered (within a column). Applying that might help to ensure the arrays are sorted in the same way.