Search code examples
pythondataframeparallel-processingspyderpool

Multiprocessing in Python using Pool


I want to parallelize a function on a dataframe using Python. I saw tutorials and I found some code.I have adjusted it to my needs. When I execute the map function the program freezes. The code seems to be solid. I wonder what the problem is.

import pandas as pd
import numpy as np
from multiprocessing import cpu_count, Pool

attributes1 = pd.read_csv('attributes1.csv')


def replace_data(data):
    for i in range(0, len(data.index)):
        temp = data.iloc[i, 1]
        temp = temp.replace('in.', 'inch')
        data.iloc[i, 1] = temp
    return data

num_partitions = 10 #number of partitions to split dataframe
num_cores = cpu_count() #number of cores on your machine

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

df1 = parallelize_dataframe(attributes1, replace_data)

Solution

  • This is a problem for Windows users only. Firstly I created another .py file lets name it helpy.py where I have my replace_data function

    def replace_data(data):
        for i in range(0, len(data.index)):
            temp = data.iloc[i, 1]
            temp = temp.replace('in.', 'inch')
            data.iloc[i, 1] = temp
        return data
    

    Then I imported my function into my main .py file.

    import pandas as pd
    import numpy as np
    from multiprocessing import cpu_count, Pool
    from helpy import replace_data
    
    attributes1 = pd.read_csv('attributes1.csv')
    
    
    num_partitions = 10 #number of partitions to split dataframe
    num_cores = cpu_count() #number of cores on your machine
    
        def parallelize_dataframe(df, func):
            df_split = np.array_split(df, num_partitions)
            pool = Pool(num_cores)
            df = pd.concat(pool.map(func, df_split))
            pool.close()
            pool.join()
            return df
    
        if __name__ == '__main__':
            df1 = parallelize_dataframe(attributes1, replace_data)
    

    I also added the if __name__ == '__main__': Now the program runs smoothly.