Fast splitting of array on column indices from each row of sparse array

Let's say I have a sparse array and a dense array that has the same number of columns but fewer rows:

from scipy.sparse import csr_matrix
import numpy as np

sp_arr = csr_matrix(np.array([[1,0,0,0,1],[0,0,1,0,0],[0,1,0,0,1],[0,0,0,1,1],[0,0,0,1,0]]))
arr = np.random.rand(10).reshape(2,5)
print(arr)
[[ 0.47027789  0.82510323  0.01321617  0.66264852  0.3618022 ]
 [ 0.80198907  0.36350616  0.10254934  0.65209401  0.094961  ]]

I would like to get an array containing all of the submatrices for the indices that contain values for each row of the sparse array. For example, the indices for data in sp_arr are as follows:

0: [0, 4] 1: [2] 2: [1, 4] 3: [3, 4] 4: [3]

My output should look like this:

array([array([[ 0.47027789,  0.3618022 ],
       [ 0.80198907,  0.094961  ]]),
       array([[ 0.01321617],
       [ 0.10254934]]),
       array([[ 0.82510323,  0.3618022 ],
       [ 0.36350616,  0.094961  ]]),
       array([[ 0.66264852,  0.3618022 ],
       [ 0.65209401,  0.094961  ]]),
       array([[ 0.66264852],
       [ 0.65209401]])], dtype=object)

I can create this with the following code, but as the size of the arrays scale up (greatly in my case) it gets really slow.

output = np.empty(sp_arr.shape[0], dtype=object)
for row in range(sp_arr.shape[0]):
    output[row] = arr[:, sp_arr[row].indices]

I've thought about vectorizing the process and applying it along the axis, but np.apply_along_axis doesn't work with sparse matrices, and unfortunately while this example is small enough to make dense and then use apply_along_axis my actual sparse matrix is much too large to do so (>100Gb).

I had thought that perhaps there is a fancy way to index or use something like hsplit to accomplish this with already vectorized methods, but so far I haven't had any luck. Is there someway to achieve this that is just escaping me?

Update

Per the answer from @Divakar, which is great, I found another way to implement the same thing with the ever so slightest, and negligible, improvement.

@Divakars best answer was:

def app2(sp_arr, arr):
    r,c = sp_arr.nonzero()
    idx = np.flatnonzero(r[1:] > r[:-1])+1    
    idx0 = np.concatenate(( [0] , idx, [r.size] ))
    arr_c = arr[:,c]
    return [arr_c[:,i:j] for i,j in zip(idx0[:-1], idx0[1:])]

Which increased my performance by 50 - 60x! But its kind of hard to read.

What I discovered is that given the csr_matrix format you can use the indices and indptr attributes to your advantage here.

def final_app():
    idx = sp_arr.indptr
    arr_c = arr[:, sp_arr.indices]
    out = [arr_c[:, i:j] for i, j in zip(idx[:-1], idx[1:])]
    return out

In the end the performance is statistically the same (less than 50ms improvement on a sparse matrix 276538 x 33114), but it feels easier to understand. More Importantly this approach does include the rows for which there are no values at all, whereas the previous approach does not. This may seem unimportant, but for my use case it is pretty critical.

Update 2

In response to @EelcoHoogendoorn. The problem is part of a parallel implementation of the alternative least squares with regularization method that I am trying to implement in python. This comes from the oft cited paper Large-scale Parallel Collaborative Filtering for the Netflix Prize The normal way of doing this is by having distributed copies of the Ratings, User, and Items matrices across processes. I thought it might be interesting to see what would happen if we constructed all of the item submatrices up front and just distributed those to the processes. That way the processes need only return the feature columns for either one user or one item respectively and those could be used to update the User and Item matrices.

The above problem was actually the bottleneck in my current implementation. And per your comment, in this situation I don't believe the transposition is critical as part of the algorithm takes the dot product of each submatrix with its transpose.

Solution

Well, there are two options - np.split or loop comprehension. In my experience, I have found out the latter to be faster. But, the priority must be to do minimal work inside the loop comprehension by doing as much of pre-processing as possible.

Approach #1 : First approach using np.split -

# Get row, col indices
r,c = sp_arr.nonzero()

# Get intervaled indices for row indices. 
# We need to use these to cut the column indexed input array.
idx = np.flatnonzero(r[1:] > r[:-1])+1
out = np.split(arr[:,c], idx, axis=1)

Sample output -

In [56]: [i.tolist() for i in out]
Out[56]: 
[[[0.47027789, 0.3618022], [0.80198907, 0.094961]],
 [[0.01321617], [0.10254934]],
 [[0.82510323, 0.3618022], [0.36350616, 0.094961]],
 [[0.66264852, 0.3618022], [0.65209401, 0.094961]],
 [[0.66264852], [0.65209401]]]

Approach #2 : The second one and should be better on performance and we will re-use r,c,idx from previous method -

idx0 = np.concatenate(( [0] , idx, [r.size] ))
arr_c = arr[:,c]
out = [arr_c[:,i:j] for i,j in zip(idx0[:-1], idx0[1:])]

See, the loop-comprehension simply slices the array of already indexed array arr_c. That's as minimal as one could get and as such should be good.

Runtime test

Approaches -

def org_app(sp_arr, arr):
    output = np.empty(sp_arr.shape[0], dtype=object)
    for row in range(sp_arr.shape[0]):
        output[row] = arr[:, sp_arr[row].indices]
    return output

def app1(sp_arr, arr):
    r,c = sp_arr.nonzero()
    idx = np.flatnonzero(r[1:] > r[:-1])+1
    return np.split(arr[:,c], idx, axis=1)

def app2(sp_arr, arr):
    r,c = sp_arr.nonzero()
    idx = np.flatnonzero(r[1:] > r[:-1])+1    
    idx0 = np.concatenate(( [0] , idx, [r.size] ))
    arr_c = arr[:,c]
    return [arr_c[:,i:j] for i,j in zip(idx0[:-1], idx0[1:])]

Timings -

In [146]: sp_arr = csr_matrix((np.random.rand(100000,100)>0.8).astype(int))
     ...: arr = np.random.rand(10,sp_arr.shape[1])
     ...: 

In [147]: %timeit org_app(sp_arr, arr)
     ...: %timeit app1(sp_arr, arr)
     ...: %timeit app2(sp_arr, arr)
     ...: 
1 loops, best of 3: 5.66 s per loop
10 loops, best of 3: 146 ms per loop
10 loops, best of 3: 105 ms per loop