Search code examples
python-3.xpandaspython-multiprocessing

How to use multiprocessing pool for Pandas apply function


I want to use pool for Pandas data frames. I tried as follows, but the following error occurs. Can't I use pool for Series?

from multiprocessing import pool

split = np.array_split(split,4)
pool = Pool(processes=4)
df = pd.concat(pool.map(split['Test'].apply(lambda x : test(x)), split))
pool.close()
pool.join()

The error message is as follows.

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers or slices, not str

Solution

  • Try:

    import pandas as pd
    import numpy as np
    import multiprocessing as mp
    
    def test(x):
        return x * 2
    
    if __name__ == '__main__':
        # Demo dataframe
        df = pd.DataFrame({'Test': range(100)})
    
        # Extract the Series and split into chunk
        split = np.array_split(df['Test'], 4)
    
        # Parallel processing
        with mp.Pool(4) as pool:
            data = pool.map(test, split)
    
        # Concatenate results
        out = pd.concat(data)
    

    Output:

    >>> df
        Test
    0      0
    1      1
    2      2
    3      3
    4      4
    ..   ...
    95    95
    96    96
    97    97
    98    98
    99    99
    
    [100 rows x 1 columns]
    
    >>> out
    0       0
    1       2
    2       4
    3       6
    4       8
         ... 
    95    190
    96    192
    97    194
    98    196
    99    198
    Name: Test, Length: 100, dtype: int64