I have a Python scipy
csr_matrix
as follows.
A = csr_matrix(np.random.choice(a=[0, 1, 2, 3, 4],
p=[0.35, 0.2, 0.15, 0.15, 0.15],
size=[10, 12]))
A dense representation would be:
[[0 2 1 2 1 0 2 0 4 2 1 2]
[0 0 1 0 2 1 0 0 0 1 4 0]
[1 3 3 2 1 1 3 0 0 4 2 0]
[4 0 0 0 0 1 1 2 1 3 0 3]
[3 0 3 1 1 3 0 3 4 4 4 0]
[0 4 0 3 0 4 4 4 0 0 3 2]
[0 3 0 0 3 0 1 0 3 0 0 0]
[0 2 0 1 2 0 4 1 3 2 1 0]
[0 2 0 4 1 1 1 3 4 2 1 1]
[0 2 3 0 3 0 4 2 3 0 4 1]]
Now, I want per row to keep the six biggest elements, and set the rest to zero. If there are multiple elements equal, It doesn't really matter which one is selected to stay and which one is set to zero, as long as there are only six non-zero elements per row. We can e.g. say that elements with lower indices stay. The expected outcome of the matrix above would then be (manually made):
[[0 2 1 2 0 0 2 0 4 2 0 2]
[0 0 1 0 2 1 0 0 0 1 4 0]
[1 3 3 2 0 0 3 0 0 4 2 0]
[4 0 0 0 0 1 1 2 0 3 0 3]
[3 0 3 0 0 3 0 0 4 4 4 0]
[0 4 0 3 0 4 4 4 0 0 3 0]
[0 3 0 0 3 0 1 0 3 0 0 0]
[0 2 0 1 2 0 4 0 3 2 0 0]
[0 2 0 4 1 0 0 3 4 2 0 0]
[0 2 3 0 3 0 4 0 3 0 4 0]]
I can think of a way to achieve this by looping over the rows, but the real matrix in question is huge, so I would prefer to keep loops to a minimum. Does anyone have a hint how to start tackling this problem?
EDIT
The following code snippet does exactly what I want, but is terribly inefficient. Transforming the 3000 x 300 matrix in this way takes 9.3 seconds.
from timeit import timeit
import warnings
from scipy.sparse import csr_matrix, SparseEfficiencyWarning
import numpy as np
__updated__ = '2015-03-12'
def random_csr_matrix():
np.random.seed(1)
A = csr_matrix(np.random.choice(a=[0, 1, 2, 3, 4],
p=[0.35, 0.2, 0.15, 0.15, 0.15],
size=[3000, 300]))
return A
def keep_top_N_in_csr_matrix(A, N):
warnings.simplefilter('ignore', SparseEfficiencyWarning)
# print ('\nBefore:')
# print (A.todense())
N_rows = A.shape[0]
for i in range(N_rows):
row = np.squeeze(np.asarray(A.getrow(i).todense()))
A[i, :] = keep_top_N_in_np_array(row, N)
# print ('\nAfter:')
# print (A.todense())
def keep_top_N_in_np_array(A, N):
assert (A >= 0).all(), 'All elements shall be nonnegative'
for _ in range(N):
i_max = A.argmax()
A[i_max] = -A[i_max]
A = np.array([-i if i < 0 else 0 for i in A])
return A
def doit_once():
A = random_csr_matrix()
keep_top_N_in_csr_matrix(A, 6)
if __name__ == '__main__':
print (timeit(doit_once, number=1))
This worked for me
import numpy as np
import numpy.random
A = numpy.random.randint(5,size=(10,15))
s = A.shape
print A
ind = np.argsort(A)[:,:s[1]-6]
rows = np.arange(s[0])[:,None]
A[rows,ind] = 0
print A
Output:
[[4 3 4 1 1 1 1 2 2 1 1 3 0 4 3]
[2 2 4 1 3 2 1 4 3 3 4 1 0 0 1]
[2 3 3 0 4 0 2 3 3 1 0 1 3 3 0]
[1 0 0 1 4 1 0 0 4 2 2 3 2 1 0]
[4 2 1 4 4 1 3 0 3 0 2 2 3 1 0]
[0 0 0 4 2 1 0 3 0 2 3 3 0 0 3]
[0 2 2 4 2 3 2 3 3 2 0 4 4 1 2]
[4 0 0 2 4 2 4 3 4 4 4 2 1 4 3]
[1 0 2 0 0 4 4 3 3 4 2 0 2 1 0]
[0 4 0 1 0 3 4 3 2 2 2 3 0 1 2]]
[[4 3 4 0 0 0 0 0 0 0 0 3 0 4 3]
[0 0 4 0 3 0 0 4 3 3 4 0 0 0 0]
[0 0 3 0 4 0 0 3 3 0 0 0 3 3 0]
[0 0 0 0 4 0 0 0 4 2 2 3 2 0 0]
[4 0 0 4 4 0 3 0 3 0 0 0 3 0 0]
[0 0 0 4 0 0 0 3 0 2 3 3 0 0 3]
[0 0 0 4 0 3 0 3 3 0 0 4 4 0 0]
[0 0 0 0 4 0 4 0 4 4 4 0 0 4 0]
[0 0 0 0 0 4 4 3 3 4 0 0 2 0 0]
[0 4 0 0 0 3 4 3 0 0 0 3 0 0 2]]