I'm new to multiprocessing in python. I have a task that takes approx 10 minutes to run & it needs to be run multiple times (different parameters) & seems like multiprocessing is a good option to reduce the total run time.
My code is a simple test which is not running as I expect, obviously I'm doing something wrong. Nothing gets printed to the console, a list processes is returned but not with dataframes but a Process object.
I need to pass a dictionary to my function which in return will return a dataframe, how do I do this?
import time
import pandas as pd
import multiprocessing as mp
def multiproc():
processes = []
settings = {1: {'sleep_time': 5,
'id': 1},
2: {'sleep_time': 1,
'id': 2},
3: {'sleep_time': 2,
'id': 3},
4: {'sleep_time': 3,
'id': 4}}
for key in settings:
p = mp.Process(target=calc_something, args=(settings[key],))
processes.append(p)
p.start()
for p in processes:
p.join()
return processes
def calc_something(settings: dict) -> pd.DataFrame:
time_to_sleep = settings['sleep_time']
time.sleep(time_to_sleep)
print(str(settings['id']))
df = some_function_creates_data_frame()
return df
Despite your indentation errors, I will risk taking a guess on your intentions.
Using a process pool is indicated when either you are submitting multiple tasks to be processed and you either want to limit the number of processors used to process these tasks or you need to return values back from the tasks (there are other ways to return a value back from a process, such as using a queue, but a using a process pool makes this easy).
import time
import pandas as pd
import multiprocessing as mp
def calc_something(settings: dict) -> pd.DataFrame:
time_to_sleep = settings['sleep_time']
time.sleep(time_to_sleep)
print(str(settings['id']))
df = pd.DataFrame({'sleep_time': [time_to_sleep], 'id': [settings['id']]})
return df
def multiproc():
settings = {1: {'sleep_time': 5,
'id': 1},
2: {'sleep_time': 1,
'id': 2},
3: {'sleep_time': 2,
'id': 3},
4: {'sleep_time': 3,
'id': 4}}
with mp.Pool() as pool:
data_frames = pool.map(calc_something, settings.values())
return data_frames
if __name__ == '__main__': # required for Windows
data_frames = multiproc()
for data_frame in data_frames:
print(data_frame)
Prints:
2
3
4
1
sleep_time id
0 5 1
sleep_time id
0 1 2
sleep_time id
0 2 3
sleep_time id
0 3 4
Important Note
When creating processes under Windows or any platform that does not use fork
to create new processes, the code that creates these processes must be invoked within a if __name__ == '__main__':
block or else you will get into a recursive loop spawning new processes. This may have been part of your problem, but it is hard to tell since in addition to your indentation problem, you did not post a minimum, reproducible example.