Search code examples
pythonpython-3.xmultithreadingpoolconcurrent.futures

Thread Pool Executor using Concurrent: no improvement for various number of workers


I'm trying to implement a task in parallel using Concurrent. Please find below a piece of code for it:

import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
import concurrent.futures

# num CPUs
cpu_num = len(os.sched_getaffinity(0))
print("Number of cpu available : ",cpu_num)

# max_Worker = cpu_num
max_Worker = 1

# A fake input array
n=1000000
array = list(range(n))
results = []

# A fake function being applied to each element of array 
def task(i):
  return i**2 

x = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=max_Worker) as executor:
  features = {executor.submit(task, j) for j in array}

  # the real function is heavy and we need to be sure of completeness of each run
  for future in concurrent.futures.as_completed(features):
    results.append(future.result())
      
results = [future.result() for future in features]
y = time.time()

print('=========================================')
print(f"Train data preparation time (s): {(y-x)}")
print('=========================================')

And now my questions,

  1. Although there is no error, is it correct/optimized?
  2. While playing with the number of workers, seems there is no improvement in the speed (e.g., 1 vs 16, no difference). Then, what's the problem and how can be solved?

Thanks in advance,


Solution

  • See my comment to your question. To the overhead I mentioned in that comment you need to also add the overhead in just creating the process pool itself.

    The following is a benchmark with several results. The first is a timing from just calling the worker function task 100000 times and creating a results list and printing out the last element of that list. It will become apparent why I have reduced the number of times I am calling task from 1000000 to 100000.

    The next attempt is to use multiprocessing to accomplish the same thing using a ProcessPoolExecutor with the submit method and then processing the Future instances that are returned.

    The next attempt is to instead use the map method with the default chunksize argument of 1 being used. It is important to understand this argument. With a chunksize value of 1, each element of the iterable that is passed to the map method is written individually to a queue of tasks as a chunk to be processed by the processes in the pool. When a pool process becomes idle looking for work, it pulls from the queue the next chunk of tasks to be performed, processes each task comprising the chunk and then becomes idle again. When there are a lot of submitted tasks being submitted via map, a chunksize value of 1 is inefficient. You would expect its performance to be equivalent to repeatedly issuing submit calls for each element of the iterable.

    The next attempt specifies a chunksize value which approximates more or less the value that the map function used by the Pool class in the multiprocessing package would have used by default. As you can see, the improvement is dramatic, but still not an improvement over the non-multiprocessing case.

    The final attempt uses the multiprocessing faciltity provided by package multiprocessing and its multiprocessing.pool.Pool class. The difference in this benchmark is that its map function uses a more intelligent default chunksize when no chunksize argument is specified.

    import os
    import time
    from concurrent.futures import ProcessPoolExecutor as PE
    from multiprocessing import Pool
    
    # A fake function being applied to each element of array
    def task(i):
      return i**2
    
    # required for Windows:
    if __name__ == '__main__':
        n=100000
    
        t1 = time.time()
        results = [task(i) for i in range(n)]
        print('Non-multiprocessing time:', time.time() - t1, results[-1])
    
        # num CPUs
        cpu_num = os.cpu_count()
        print("Number of CPUs available: ",cpu_num)
    
        t1 = time.time()
        with PE(max_workers=cpu_num) as executor:
            futures = [executor.submit(task, i) for i in range(n)]
            results = [future.result() for future in futures]
        print('Multiprocessing time using submit:', time.time() - t1,  results[-1])
    
        t1 = time.time()
        with PE(max_workers=cpu_num) as executor:
            results = list(executor.map(task, range(n)))
        print('Multiprocessing time using map:', time.time() - t1, results[-1])
    
        t1 = time.time()
        chunksize = n // (4 * cpu_num)
        with PE(max_workers=cpu_num) as executor:
            results = list(executor.map(task, range(n), chunksize=chunksize))
        print(f'Multiprocessing time using map: {time.time() - t1}, chunksize: {chunksize}', results[-1])
    
        t1 = time.time()
        with Pool(cpu_num) as executor:
            results = executor.map(task, range(n))
        print('Multiprocessing time using Pool.map:', time.time() - t1, results[-1])
    

    Prints:

    Non-multiprocessing time: 0.027019739151000977 9999800001
    Number of CPUs available:  8
    Multiprocessing time using submit: 77.34723353385925 9999800001
    Multiprocessing time using map: 79.52981925010681 9999800001
    Multiprocessing time using map: 0.30500149726867676, chunksize: 3125 9999800001
    Multiprocessing time using Pool.map: 0.2799997329711914 9999800001
    

    Update

    The following bechmarks use a version of task that is very CPU-intensive and shows the benefit of multiprocessing. It would also seem for this small iterable size (100), forcing a chunksize value of 1 for the Pool.map case (it would by default compute a chunksize value of 4), is slightly more performant.

    import os
    import time
    from concurrent.futures import ProcessPoolExecutor as PE
    from multiprocessing import Pool
    
    # A fake function being applied to each element of array
    def task(i):
        for _ in range(1_000_000):
            result = i ** 2
        return result
    
    def compute_chunksize(iterable_size, pool_size):
        chunksize, remainder = divmod(iterable_size, pool_size * 4)
        if remainder:
            chunksize += 1
        return chunksize
    
    # required for Windows:
    if __name__ == '__main__':
        n = 100
        cpu_num = os.cpu_count()
        chunksize = compute_chunksize(n, cpu_num)
    
        t1 = time.time()
        results = [task(i) for i in range(n)]
        t2 = time.time()
        print('Non-multiprocessing time:', t2 - t1, results[-1])
    
        # num CPUs
        print("Number of CPUs available: ",cpu_num)
    
        t1 = time.time()
        with PE(max_workers=cpu_num) as executor:
            futures = [executor.submit(task, i) for i in range(n)]
            results = [future.result() for future in futures]
            t2 = time.time()
        print('Multiprocessing time using submit:', t2 - t1,  results[-1])
    
        t1 = time.time()
        with PE(max_workers=cpu_num) as executor:
            results = list(executor.map(task, range(n)))
            t2 = time.time()
        print('Multiprocessing time using map:', t2 - t1, results[-1])
    
        t1 = time.time()
    
        with PE(max_workers=cpu_num) as executor:
            results = list(executor.map(task, range(n), chunksize=chunksize))
            t2 = time.time()
        print(f'Multiprocessing time using map: {t2 - t1}, chunksize: {chunksize}', results[-1])
    
        t1 = time.time()
        with Pool(cpu_num) as executor:
            results = executor.map(task, range(n))
            t2 = time.time()
        print('Multiprocessing time using Pool.map:', t2 - t1, results[-1])
    
        t1 = time.time()
        with Pool(cpu_num) as executor:
            results = executor.map(task, range(n), chunksize=1)
            t2 = time.time()
        print('Multiprocessing time using Pool.map (chunksize=1):', t2 - t1, results[-1])
    

    Prints:

    Non-multiprocessing time: 23.12758779525757 9801
    Number of CPUs available:  8
    Multiprocessing time using submit: 5.336004018783569 9801
    Multiprocessing time using map: 5.364996671676636 9801
    Multiprocessing time using map: 5.444890975952148, chunksize: 4 9801
    Multiprocessing time using Pool.map: 5.400001287460327 9801
    Multiprocessing time using Pool.map (chunksize=1): 4.698001146316528 9801