Search code examples
pythonpandasparallel-processingpandarallel

Call parallel_apply for batch of rows


I need to apply a function on df, I used a pandarallel to parallelize the process, however, I have an issue here, I need to give func_do an N rows each call so that I can utilize a vectorization on that function.

The following will call func_do on each row. Any idea how to make a single call for each batch and keep the parallelization process.

def fun_do(value_col):
    return do(value_col)
df['processed_col'] = df.parallel_apply(lambda row: fun_do(row['col']), axis=1)

Solution

  • A possible solution is to create virtual groups of N rows:

    import pandas as pd
    from pandarallel import pandarallel
    
    # Setup MRE
    pandarallel.initialize(progress_bar=False)
    df = pd.DataFrame({'col1': np.linspace(0, 100, 11)})
    
    def fun_do(sr):
        return sr**2
    N = 4  # size of chunk
    df['col2'] = (df.groupby(pd.RangeIndex(len(df)) // N)
                    .parallel_apply(lambda x: fun_do(x['col1']))
                    .droplevel(0))  # <- remove virtual group index
    

    Output:

    >>> df
         col1     col2
    0     0.0      0.0
    1    10.0    100.0
    2    20.0    400.0
    3    30.0    900.0
    4    40.0   1600.0
    5    50.0   2500.0
    6    60.0   3600.0
    7    70.0   4900.0
    8    80.0   6400.0
    9    90.0   8100.0
    10  100.0  10000.0
    

    Note: I don't know why groupby(...)['col'].parallel_apply(fun_do) doesn't work. It seems parallel_apply is not available with SeriesGroupBy.
    This is the first time I use pandarallel, usually I used multiprocessing module