I want to parallelize a function on a dataframe using Python. I saw tutorials and I found some code.I have adjusted it to my needs. When I execute the map function the program freezes. The code seems to be solid. I wonder what the problem is.
import pandas as pd
import numpy as np
from multiprocessing import cpu_count, Pool
attributes1 = pd.read_csv('attributes1.csv')
def replace_data(data):
for i in range(0, len(data.index)):
temp = data.iloc[i, 1]
temp = temp.replace('in.', 'inch')
data.iloc[i, 1] = temp
return data
num_partitions = 10 #number of partitions to split dataframe
num_cores = cpu_count() #number of cores on your machine
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
df1 = parallelize_dataframe(attributes1, replace_data)
This is a problem for Windows users only. Firstly I created another .py file lets name it helpy.py where I have my replace_data function
def replace_data(data):
for i in range(0, len(data.index)):
temp = data.iloc[i, 1]
temp = temp.replace('in.', 'inch')
data.iloc[i, 1] = temp
return data
Then I imported my function into my main .py file.
import pandas as pd
import numpy as np
from multiprocessing import cpu_count, Pool
from helpy import replace_data
attributes1 = pd.read_csv('attributes1.csv')
num_partitions = 10 #number of partitions to split dataframe
num_cores = cpu_count() #number of cores on your machine
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
if __name__ == '__main__':
df1 = parallelize_dataframe(attributes1, replace_data)
I also added the if __name__ == '__main__':
Now the program runs smoothly.