python multiprocessing slice post-processing

Python - Assign slices to sparse matrix after multiprocessing

I am currently working on a nonlocal method for image denoising and was asked to build a kernel, let's call it w, indexed by each pixel of an image. For context, if an image has NxM pixels, then the kernel (stored as a matrix) will have (NxM)^4 elements. Luckily, the matrix is sparse, symmetric, and its diagonal is known. So I designed a function win_largo(i,l), where i is the index of the i-th row of w and l is the number of elements I want to compute, this returns a CSR sparse row and its transpose.

I have an image of 236^2pixels, so in my case l=1500 has enough information and I can compute the first 1500 rows and columns efficiently. However, there are 54 196 rows left. To compute them, I used the multiprocessing module like this:

import multiprocessing as mp
nm = N*M

pool    = mp.Pool(processes=15)
results = [pool.apply_async(win_largo, args=(i,1500,) ) for i in xrange(1500, nm) ]
pool.close()
pool.join()

I get a pretty quick calculation (couple minutes), that would instead take around 6.45 hours to compute. However, now I need to save this result.

To do so, I wrote the following loop, where w is again stored as a CSR matrix:

j = 0
for i in xrange(1500,nm):
    w[i,i-l:i], w[i-l:i, i] = results[j].get()
    j += 1

But it takes around 6.52 hours to complete. Is there a faster way to do this? Nowadays it seems like the multiprocessing loop is useless.

I am using python 2.7 with macOS High Sierra.

Solution

I found a way.

First of all, slicing isn't the way. By testing the following:

%%timeit -n 2
ws  = sparse.csr_matrix((nm, nm), dtype=np.float)
for i in xrange(5925,6000):                        # 75 iterations
    ws[i,i-l:i], ws[i-l:i, i] = win_largo(i,l)

I got

2 loops, best of 3: 2.02 s per loop

Whereas, win_largo takes a couple of mili seconds to compute.

Since slicing isn't the best way to assign the computed objects, I decided to take an advantage of the elements of the sparse slices returned by win_largo. Therefore, I modified the function so that it returns the set of indices in which it is non zero and its corresponding data set:

%%timeit -n 5
row_ind, col_ind, datos = [], [], []

for i in xrange(5500,6000):                        # 500 iterations
    indx_w, datos_w = win_largo(i,l)
    row_ind.extend( np.repeat(i, len(indx_w)).tolist() )
    col_ind.extend( np.add(i-l, indx_w).tolist() )
    datos.extend( datos_w.tolist() )

What I got, left me speechless:

5 loops, best of 3: 2.02 s per loop

Therefore, the whole loop took 3 min and 32 s to complete. WAY less than 6 hours… approximately 0.92% of the original time. That's way a lot.

After this, I was able to add some extra data to each list, so that the resulting sparse object would be:

ws = sparse.csr_matrix((datos, (row_ind, col_ind)), [nm, nm])

This just took a couple seconds. In total, I was able to get this matrix in 4 minutes :D