Why is my numpy code with threading not parallel?

I need to perform some calculations on a raster (matrix) for several points neighbourhoods. My idea was to do these calculations in parallel threads and then sum up the resulting rasters. My problem is that the execution does not seem to be run in parallel. When I multiply the number of points by 2, I get 2 times longer execution. What am I doing wrong?

from threading import Lock, Thread
import numpy as np
import time

SIZE = 1000000
THREADS = 8
my_lock=Lock()
results = np.zeros(SIZE,dtype=np.float64)

def do_job(j):
    global results
    s_time = time.time()  
    print("Starting... "+str(j))

    #do some calculations
    c_r=np.zeros(SIZE,dtype=np.float64)
    for i in range(SIZE):
        c_r[i]=np.exp(-0.001*i)

    print("\t Calculation at job "+str(j)+" lasted: {:3.3f}".format(time.time()-s_time))

    #sum up the results
    if my_lock.acquire(blocking=True):
        results = np.add(results,c_r)
        my_lock.release()

    print("\t Job "+str(j)+" lasted: {:3.3f}".format(time.time()-s_time))



def main():
    global THREADS
    s_time = time.time()  
    threads=[]

    while THREADS>0:

        p = Thread(target=do_job,args=(THREADS,))
        threads.append(p)
        p.start()
        THREADS = THREADS-1

    print("Start finished after : {:3.3f}".format(time.time()-s_time))
    for p in threads:
        p.join()

    print("Total run diuration: {:3.3f}".format(time.time()-s_time))


if __name__ == "__main__":
    main()

when I run the code with THREADS=4 I get:

Starting... 4
Starting... 3
Starting... 2
Starting... 1
Start finished after : 0.069
         Calculation at job 4 lasted: 5.805
         Job 4 lasted: 5.887
         Calculation at job 3 lasted: 6.230
         Job 3 lasted: 6.237
         Calculation at job 1 lasted: 6.585
         Job 1 lasted: 6.595
         Calculation at job 2 lasted: 6.737
         Job 2 lasted: 6.738
Total run diuration: 6.760

When I switch to THREADS = 8 the execution time gets roughly doubled:

Starting... 8
Starting... 7
Starting... 6
Starting... 5
Starting... 4
Starting... 3
Starting... 1
Start finished after : 0.182
Starting... 2
         Calculation at job 7 lasted: 11.883
         Job 7 lasted: 11.939
         Calculation at job 8 lasted: 13.096
         Job 8 lasted: 13.144
         Calculation at job 1 lasted: 13.548
         Job 1 lasted: 13.576
         Calculation at job 3 lasted: 13.723
         Job 3 lasted: 13.748
         Calculation at job 2 lasted: 14.231
         Job 2 lasted: 14.268
         Calculation at job 5 lasted: 14.698
         Job 5 lasted: 14.708
         Calculation at job 4 lasted: 15.000
         Job 4 lasted: 15.015
         Calculation at job 6 lasted: 15.133
         Job 6 lasted: 15.135
Total run diuration: 15.136

Solution

Your are hit by Global Interpreter Lock (GIL) see https://wiki.python.org/moin/GlobalInterpreterLock.

Only one "thread" can enter the interpreter at the time. Your code mostly works inside for i in range(SIZE) loop which in executed by Python interpreter. The context switch can happen only on IO operation or when you call C-function (which releases GIL). Moreover the cost of switching between threads is large in comparison to operation executed by the thread. That is why adding more threads slows down execution.

According to numpy documentation, many operation releases GIL therefore you could gain advantage from threading if you vectorized your operation forcing program to spend more time inside numpy.

See post: Why are numpy calculations not affected by the global interpreter lock?

Try to modify from:

for i in range(SIZE):
        c_r[i]=np.exp(-0.001*i)

to:

c_r = np.exp(-0.001*np.arange(SIZE))