How to efficiently make new matrix from sum of blocks from bigger sparse matrix

I have a large scipy sparse symmetric matrix which I need to condense by taking the sum of blocks to make a new smaller matrix.

For example, for a 4x4 sparse matrix A I will like to make a 2x2 matrix B in which B[i,j] = sum(A[i:i+2,j:j+2]).

Currently, I just go block by block to recreate the condensed matrix but this is slow. Any ideas on how to optimize this?

Update: Here is an example code that works fine, but is slow for a sparse matrix of 50.000x50.000 that I want to condense in a 10.000x10.000:

>>> A = (rand(4,4)<0.3)*rand(4,4)
>>> A = scipy.sparse.lil_matrix(A + A.T) # make the matrix symmetric

>>> B = scipy.sparse.lil_matrix((2,2))
>>> for i in range(B.shape[0]):
...     for j in range(B.shape[0]):
...         B[i,j] = A[i:i+2,j:j+2].sum()

Solution

First of all, lil matrix for the one your are summing up is probably really bad, I would try COO or maybe CSR/CSS (I don't know which will be better, but lil is probably inherently slower for many of these operations, even the slicing might be much slower, though I did not test). (Unless you know that for example dia fits perfectly)

Based on COO I could imagine doing some tricking around. Since COO has row and col arrays to give the exact positions:

matrix = A.tocoo()

new_row = matrix.row // 5
new_col = matrix.col // 5
bin = (matrix.shape[0] // 5) * new_col + new_row
# Now do a little dance because this is sparse,
# and most of the possible bin should not be in new_row/new_col
# also need to group the bins:
unique, bin = np.unique(bin, return_inverse=True)
sum = np.bincount(bin, weights=matrix.data)
new_col = unique // (matrix.shape[0] // 5)
new_row = unique - new_col * (matrix.shape[0] // 5)

result = scipy.sparse.coo_matrix((sum, (new_row, new_col)))

(I won't guarantee that I didn't confuse row and column somewhere and this only works for square matrices...)