Search code examples
pythonnumpyscipysparse-matrix

Keep in csr_matrix per row the 6 biggest elements


I have a Python scipy csr_matrix as follows.

A = csr_matrix(np.random.choice(a=[0, 1, 2, 3, 4], 
                                p=[0.35, 0.2, 0.15, 0.15, 0.15], 
                                size=[10, 12]))

A dense representation would be:

[[0 2 1 2 1 0 2 0 4 2 1 2]
 [0 0 1 0 2 1 0 0 0 1 4 0]
 [1 3 3 2 1 1 3 0 0 4 2 0]
 [4 0 0 0 0 1 1 2 1 3 0 3]
 [3 0 3 1 1 3 0 3 4 4 4 0]
 [0 4 0 3 0 4 4 4 0 0 3 2]
 [0 3 0 0 3 0 1 0 3 0 0 0]
 [0 2 0 1 2 0 4 1 3 2 1 0]
 [0 2 0 4 1 1 1 3 4 2 1 1]
 [0 2 3 0 3 0 4 2 3 0 4 1]]

Now, I want per row to keep the six biggest elements, and set the rest to zero. If there are multiple elements equal, It doesn't really matter which one is selected to stay and which one is set to zero, as long as there are only six non-zero elements per row. We can e.g. say that elements with lower indices stay. The expected outcome of the matrix above would then be (manually made):

[[0 2 1 2 0 0 2 0 4 2 0 2]
 [0 0 1 0 2 1 0 0 0 1 4 0]
 [1 3 3 2 0 0 3 0 0 4 2 0]
 [4 0 0 0 0 1 1 2 0 3 0 3]
 [3 0 3 0 0 3 0 0 4 4 4 0]
 [0 4 0 3 0 4 4 4 0 0 3 0]
 [0 3 0 0 3 0 1 0 3 0 0 0]
 [0 2 0 1 2 0 4 0 3 2 0 0]
 [0 2 0 4 1 0 0 3 4 2 0 0]
 [0 2 3 0 3 0 4 0 3 0 4 0]]

I can think of a way to achieve this by looping over the rows, but the real matrix in question is huge, so I would prefer to keep loops to a minimum. Does anyone have a hint how to start tackling this problem?

EDIT

The following code snippet does exactly what I want, but is terribly inefficient. Transforming the 3000 x 300 matrix in this way takes 9.3 seconds.

from timeit import timeit
import warnings

from scipy.sparse import csr_matrix, SparseEfficiencyWarning

import numpy as np


__updated__ = '2015-03-12'


def random_csr_matrix():
    np.random.seed(1)
    A = csr_matrix(np.random.choice(a=[0, 1, 2, 3, 4],
                                    p=[0.35, 0.2, 0.15, 0.15, 0.15],
                                    size=[3000, 300]))
    return A

def keep_top_N_in_csr_matrix(A, N):
    warnings.simplefilter('ignore', SparseEfficiencyWarning)

#     print ('\nBefore:')
#     print (A.todense())


    N_rows = A.shape[0]
    for i in range(N_rows):
        row = np.squeeze(np.asarray(A.getrow(i).todense()))
        A[i, :] = keep_top_N_in_np_array(row, N)

#     print ('\nAfter:')
#     print (A.todense())

def keep_top_N_in_np_array(A, N):
    assert (A >= 0).all(), 'All elements shall be nonnegative'

    for _ in range(N):
        i_max = A.argmax()
        A[i_max] = -A[i_max]
    A = np.array([-i if i < 0 else 0 for i in A])

    return A

def doit_once():
    A = random_csr_matrix()
    keep_top_N_in_csr_matrix(A, 6)

if __name__ == '__main__':
    print (timeit(doit_once, number=1))

Solution

  • This worked for me

    import numpy as np
    import numpy.random
    
    A = numpy.random.randint(5,size=(10,15))
    s = A.shape
    print A
    
    ind = np.argsort(A)[:,:s[1]-6]
    rows = np.arange(s[0])[:,None]
    
    A[rows,ind] = 0
    print A
    

    Output:

    [[4 3 4 1 1 1 1 2 2 1 1 3 0 4 3]
     [2 2 4 1 3 2 1 4 3 3 4 1 0 0 1]
     [2 3 3 0 4 0 2 3 3 1 0 1 3 3 0]
     [1 0 0 1 4 1 0 0 4 2 2 3 2 1 0]
     [4 2 1 4 4 1 3 0 3 0 2 2 3 1 0]
     [0 0 0 4 2 1 0 3 0 2 3 3 0 0 3]
     [0 2 2 4 2 3 2 3 3 2 0 4 4 1 2]
     [4 0 0 2 4 2 4 3 4 4 4 2 1 4 3]
     [1 0 2 0 0 4 4 3 3 4 2 0 2 1 0]
     [0 4 0 1 0 3 4 3 2 2 2 3 0 1 2]]
    
    [[4 3 4 0 0 0 0 0 0 0 0 3 0 4 3]
     [0 0 4 0 3 0 0 4 3 3 4 0 0 0 0]
     [0 0 3 0 4 0 0 3 3 0 0 0 3 3 0]
     [0 0 0 0 4 0 0 0 4 2 2 3 2 0 0]
     [4 0 0 4 4 0 3 0 3 0 0 0 3 0 0]
     [0 0 0 4 0 0 0 3 0 2 3 3 0 0 3]
     [0 0 0 4 0 3 0 3 3 0 0 4 4 0 0]
     [0 0 0 0 4 0 4 0 4 4 4 0 0 4 0]
     [0 0 0 0 0 4 4 3 3 4 0 0 2 0 0]
     [0 4 0 0 0 3 4 3 0 0 0 3 0 0 2]]