Search code examples
pythonpandasparallel-processingmultiprocessingjoblib

Shared-memory pandas data frame object in joblib.parallel


I'm using parallel function from joblib to parallelize a task. All processes take as input a pandas dataframe. In order to reduce the run-time memory used it is possible to sharing this dataframe? All processes read-only on it. I found a similar solution but for a numpy array and using multiprocessing here: Shared-memory objects in multiprocessing

this is the snippet of the code:

from joblib import Parallel, delayed

def fun(df, cat):

    a = df[ df[ y ] != cat ]
    b = df[ df[ y ] == cat ]
    ...

output = Parallel(n_jobs=-1)(delayed(func())(df, cat) for cat in labels )

df is a pandas dataframe and labels is just a list.


Solution

  • I solved passing directly the filter dataframes

    output = Parallel(n_jobs=-1)(delayed(func)(df[ df[ target ] == cat ], 
                                               df[ df[ target ] !=  cat ], 
                                                cat) for cat in labels )