Search code examples
pythonpython-3.xnumpyjoblib

joblib.Parallel processes the same set of data multiple times instead of different sets


I have a matrix array of 3D brain images which I am doing some processing for these images.

The input matrix looks like M[X, Y]: where X is the brain id and Y is the data which I am reshape it later to make some enhancement for

The following sequential code do it perfectly:

def transform(X):
 data = np.reshape(X, (-1, 176, 208, 176))
 data_cropped = np.empty((data.shape[0], 90, 100, 70))
 for idx in range(0, data.shape[0]):
    data_cropped[idx, :, :, :] = data[idx, 40:130, 40:140, 50:120]

 data_cropped = perm(data_cropped)
 #data_cropped = impute_data(data_cropped)
 data_cropped = np.reshape(data_cropped, (data_cropped.shape[0], -1))
 #data_cropped = data_cropped[:, np.apply_along_axis(np.count_nonzero, 0, data_cropped) != 0]

 return data_cropped


X_train = np.load("./data_original/X_train.npy")
X_crop = transform(X_train)

The output of this code portion when running sequentially (normal for loop) is:

brain: 0

brain: 1

brain: 2

brain: 3

...

The problem is that it takes very long time (around 60 min) to process all the brains.

I was trying to make the code running in parallel but I am unable to process all brains! Only brain 0 is being processed multiple times.

There is my try to parallelize the code:

num_cores = multiprocessing.cpu_count()
X_train = np.load("./data_original/X_train.npy")
X_crop = Parallel(n_jobs=num_cores)(delayed(transform)(i) for i in X_train)

But I got this result:

brain: 0

brain: 0

brain: 0

brain: 0

...

Any idea how to solve this problem? Thanks


Solution

  • You need to

    • split your data appropriately between the jobs AND
    • provide your worker code the information to correctly produce displayed brain indices.

    for i in X_train produces rows of X_train (along the first dimension), one at a time, and they have one dimension less than the initial array:

    In [7]: a=np.random.random((2,10))
    
    In [10]: a.shape
    Out[10]: (2, 10)
    
    In [11]: [i.shape for i in a]
    Out[11]: [(10,), (10,)]
    

    Since you didn't give all the sample code to reproduce the issue, I cannot say what shape your processing code expects.


    Then, apparently, the number after "brain:" is the index of a row in an input. If you feed each job a part of the input array, naturally, they will all produce the same indices. You need to somehow tell each job its staring index so that they calculate absolute indices correctly.