Search code examples
pythonpandasparallel-processingbodo

Parallelize apply after pandas groupby


I have used rosetta.parallel.pandas_easy to parallelize apply after groupby, for example:

from rosetta.parallel.pandas_easy import groupby_to_series_to_frame
df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
groupby_to_series_to_frame(df, np.mean, n_jobs=8, use_apply=True, by=df.index)

However, has anyone figured out how to parallelize a function that returns a DataFrame? This code fails for rosetta, as expected.

def tmpFunc(df):
    df['c'] = df.a + df.b
    return df

df.groupby(df.index).apply(tmpFunc)
groupby_to_series_to_frame(df, tmpFunc, n_jobs=1, use_apply=True, by=df.index)

Solution

  • This seems to work, although it really should be built in to pandas

    import pandas as pd
    from joblib import Parallel, delayed
    import multiprocessing
    
    def tmpFunc(df):
        df['c'] = df.a + df.b
        return df
    
    def applyParallel(dfGrouped, func):
        retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
        return pd.concat(retLst)
    
    if __name__ == '__main__':
        df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
        print 'parallel version: '
        print applyParallel(df.groupby(df.index), tmpFunc)
    
        print 'regular version: '
        print df.groupby(df.index).apply(tmpFunc)
    
        print 'ideal version (does not work): '
        print df.groupby(df.index).applyParallel(tmpFunc)