Search code examples
pythonnumpymemorysparse-matrix

How to efficiently create a CSR row_index vector in python?


I am trying to create a CSR matrix with m rows, n columns, filled with zeroes and ones (at most one per column). I have a numpy array idcs with indices where my 1s are located, ranging from x to n.

My first approach to create the ROW_INDEX vector was something like :

ROW_INDEX=np.zeros(m+1)
for i in idcs: ROW_INDEX[i+1:]+=1

Unsurprinsingly though, this is rather slow. I then tried the good old space for speed swap :

ROW_INDEX=np.fromfunction(lambda i,j: i>idcs[j],(m+1,n),dtype='uintc')
ROW_INDEX=np.sum(ROW_INDEX,1)

However, m and n both are at 10^5 so the above code raises a MemoryError - even though the large matrix is technically only boolean.

I feel like I missed something obvious here. Anyone has a smarter solution, or should I just increase memory ?

End purpose is to create a PETSc.Mat, hopefully in parallel, starting from something like B=petsc4py.Mat().createAIJ([m, n],csr=[ROW_INDEX,COL_INDEX,V]). I've found little documentation on the subject, any help on that front would also be welcome.


Solution

  • I think you're looking for something like this?

    ROW_INDEX=np.zeros(m+1)
    np.add.at(ROW_INDEX, idcs+1, 1)
    np.cumsum(ROW_INDEX, out=ROW_INDEX)