Search code examples
pythonprocess-pool

Parallel processing with ProcessPoolExecutor


I have a huge list of elements which somehow must be processed. I know that it can be done with Process from multiprocessing by:

pr1 = Process(calculation_function, (args, ))
pr1.start()
pr1.join()

and so I can create lets say 10 processes and pass arguments split by 10 to args. And then job is done.

But I do not want to create it manually and calculate it manually. Instead I want to use ProcessPoolExecutor and I am doing it like this:

executor = ProcessPoolExecutor(max_workers=10)
executor.map(calculation, (list_to_process,))

calculation is my function which do the job.

def calculation(list_to_process):
    for element in list_to_process:
        # .... doing the job

list_to_process is my list to be processed.

But instead after running this code, iteration on loop goes just one time. I thought that

executor = ProcessPoolExecutor(max_workers=10)
executor.map(calculation, (list_to_process,))

is the same as this 10 times:

pr1 = Process(calculation, (list_to_process, ))
pr1.start()
pr1.join()

But it seems to be wrong.

How to achieve real multiprocessing by ProcessPoolExecutor?


Solution

  • Remove the for loop from your calculation function. Now that you're using ProcessPoolExecutor.map, that map() call is your loop, the difference being that each element in the list is sent to a different process. E.g.

    def calculation(item):
        print('[pid:%s] performing calculation on %s' % (os.getpid(), item))
        time.sleep(5)
        print('[pid:%s] done!' % os.getpid())
        return item ** 2
    
    executor = ProcessPoolExecutor(max_workers=5)
    list_to_process = range(10)
    result = executor.map(calculation, list_to_process)
    

    You'll see something in the terminal like:

    [pid:23988] performing calculation on 0
    [pid:10360] performing calculation on 1
    [pid:13348] performing calculation on 2
    [pid:24032] performing calculation on 3
    [pid:18028] performing calculation on 4
    [pid:23988] done!
    [pid:23988] performing calculation on 5
    [pid:10360] done!
    [pid:13348] done!
    [pid:10360] performing calculation on 6
    [pid:13348] performing calculation on 7
    [pid:18028] done!
    [pid:24032] done!
    [pid:18028] performing calculation on 8
    [pid:24032] performing calculation on 9
    [pid:23988] done!
    [pid:10360] done!
    [pid:13348] done!
    [pid:18028] done!
    [pid:24032] done!
    

    Though the order of events will be effectively random. The return value (at least in my Python version), is actually an itertools.chain object for some reason. But that's an implementation detail. You can return the result as a list like:

    >>> list(result)
    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
    

    In your example code you've instead passed a single-element tuple (list_to_process,) so that's just going to pass your full list to one process.