In order to understand how threads work in Python, I wrote the following simple function:
def sum_list(thelist:list, start:int, end:int):
s = 0
for i in range(start,end):
s += thelist[i]**3//10
return s
Then I created a list and tested how much time it takes to compute its sum:
LISTSIZE = 5000000
big_list = list(range(LISTSIZE))
start = time.perf_counter()
big_sum=sum_list(big_list, 0, LISTSIZE)
print(f"One thread: sum={big_sum}, time={time.perf_counter()-start} sec")
It took about 2 seconds.
Then I tried to partition the computation into threads, such that each thread computes the function on a subset of the list:
THREADCOUNT=4
SUBLISTSIZE = LISTSIZE//THREADCOUNT
start = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor(THREADCOUNT) as executor:
futures = [executor.submit(sum_list, big_list, i*SUBLISTSIZE, (i+1)*SUBLISTSIZE) for i in range(THREADCOUNT)]
big_sum = 0
for res in concurrent.futures.as_completed(futures): # return each result as soon as it is completed:
big_sum += res.result()
print(f"{THREADCOUNT} threads: sum={big_sum}, time={time.perf_counter()-start} sec")
Since I have a 4-cores CPU, I expected it to run 4 times faster. But it did not: it ran in about 1.8 seconds on my Ubuntu machine (on my Windows machine, with 8 cores, it ran even slower than the single-thread version: about 2.2 seconds).
Is there a way to use ThreadPoolExecutor
(or another threads-based mechanism in Python) so that I can compute this function faster?
The problem is that the function you are trying to make faster is CPU-bound and the Python Global Interpreter Lock (GIL) prevents any performance gain from parallelisation of such code.
In Python, threads are wrapper around genuine OS thread. However, in order to avoid race conditions due to concurrent execution, only one thread can access the Python interpreter to execute bytecode at a time. This restriction is enforced by a lock called the GIL.
Thus in Python, true multithreading cannot be achieved and multiprocessing should be used instead. However, note that the GIL is not locked by IO operations (file reading, networking, etc.) and some library code (numpy, etc.) so these operations can still benefit from Python multithreading.
The function sum_list
used neither of those operations so it will not benefit from Python multithreading.
You can use ProcessPoolExecutor
to effectively get parallelism but this may copy the input list in your case. Multiprocessing is equivalent to launching multiple independent Python interpreters, thus the GIL's (one per intepreter) is not an issue anymore. However, multiprocessing incurs performance penalties during inter-process communication.