python pandas parallel-processing pandarallel

Call parallel_apply for batch of rows

I need to apply a function on df, I used a pandarallel to parallelize the process, however, I have an issue here, I need to give func_do an N rows each call so that I can utilize a vectorization on that function.

The following will call func_do on each row. Any idea how to make a single call for each batch and keep the parallelization process.

def fun_do(value_col):
    return do(value_col)
df['processed_col'] = df.parallel_apply(lambda row: fun_do(row['col']), axis=1)

Solution

A possible solution is to create virtual groups of N rows:

import pandas as pd
from pandarallel import pandarallel

# Setup MRE
pandarallel.initialize(progress_bar=False)
df = pd.DataFrame({'col1': np.linspace(0, 100, 11)})

def fun_do(sr):
    return sr**2
N = 4  # size of chunk
df['col2'] = (df.groupby(pd.RangeIndex(len(df)) // N)
                .parallel_apply(lambda x: fun_do(x['col1']))
                .droplevel(0))  # <- remove virtual group index

Output:

>>> df
     col1     col2
0     0.0      0.0
1    10.0    100.0
2    20.0    400.0
3    30.0    900.0
4    40.0   1600.0
5    50.0   2500.0
6    60.0   3600.0
7    70.0   4900.0
8    80.0   6400.0
9    90.0   8100.0
10  100.0  10000.0

Note: I don't know why groupby(...)['col'].parallel_apply(fun_do) doesn't work. It seems parallel_apply is not available with SeriesGroupBy.
This is the first time I use pandarallel, usually I used multiprocessing module