I need to perform some calculations on a raster (matrix) for several points neighbourhoods. My idea was to do these calculations in parallel threads and then sum up the resulting rasters. My problem is that the execution does not seem to be run in parallel. When I multiply the number of points by 2, I get 2 times longer execution. What am I doing wrong?
from threading import Lock, Thread
import numpy as np
import time
SIZE = 1000000
THREADS = 8
my_lock=Lock()
results = np.zeros(SIZE,dtype=np.float64)
def do_job(j):
global results
s_time = time.time()
print("Starting... "+str(j))
#do some calculations
c_r=np.zeros(SIZE,dtype=np.float64)
for i in range(SIZE):
c_r[i]=np.exp(-0.001*i)
print("\t Calculation at job "+str(j)+" lasted: {:3.3f}".format(time.time()-s_time))
#sum up the results
if my_lock.acquire(blocking=True):
results = np.add(results,c_r)
my_lock.release()
print("\t Job "+str(j)+" lasted: {:3.3f}".format(time.time()-s_time))
def main():
global THREADS
s_time = time.time()
threads=[]
while THREADS>0:
p = Thread(target=do_job,args=(THREADS,))
threads.append(p)
p.start()
THREADS = THREADS-1
print("Start finished after : {:3.3f}".format(time.time()-s_time))
for p in threads:
p.join()
print("Total run diuration: {:3.3f}".format(time.time()-s_time))
if __name__ == "__main__":
main()
when I run the code with THREADS=4 I get:
Starting... 4
Starting... 3
Starting... 2
Starting... 1
Start finished after : 0.069
Calculation at job 4 lasted: 5.805
Job 4 lasted: 5.887
Calculation at job 3 lasted: 6.230
Job 3 lasted: 6.237
Calculation at job 1 lasted: 6.585
Job 1 lasted: 6.595
Calculation at job 2 lasted: 6.737
Job 2 lasted: 6.738
Total run diuration: 6.760
When I switch to THREADS = 8 the execution time gets roughly doubled:
Starting... 8
Starting... 7
Starting... 6
Starting... 5
Starting... 4
Starting... 3
Starting... 1
Start finished after : 0.182
Starting... 2
Calculation at job 7 lasted: 11.883
Job 7 lasted: 11.939
Calculation at job 8 lasted: 13.096
Job 8 lasted: 13.144
Calculation at job 1 lasted: 13.548
Job 1 lasted: 13.576
Calculation at job 3 lasted: 13.723
Job 3 lasted: 13.748
Calculation at job 2 lasted: 14.231
Job 2 lasted: 14.268
Calculation at job 5 lasted: 14.698
Job 5 lasted: 14.708
Calculation at job 4 lasted: 15.000
Job 4 lasted: 15.015
Calculation at job 6 lasted: 15.133
Job 6 lasted: 15.135
Total run diuration: 15.136
Your are hit by Global Interpreter Lock (GIL) see https://wiki.python.org/moin/GlobalInterpreterLock.
Only one "thread" can enter the interpreter at the time.
Your code mostly works inside for i in range(SIZE)
loop which in executed by Python interpreter. The context switch can happen only on IO operation or when you call C-function (which releases GIL). Moreover the cost of switching between threads is large in comparison to operation executed by the thread. That is why adding more threads slows down execution.
According to numpy documentation, many operation releases GIL therefore you could gain advantage from threading if you vectorized your operation forcing program to spend more time inside numpy.
See post: Why are numpy calculations not affected by the global interpreter lock?
Try to modify from:
for i in range(SIZE):
c_r[i]=np.exp(-0.001*i)
to:
c_r = np.exp(-0.001*np.arange(SIZE))