Numba function take long time for assign value to an array

I wrote a function to calculate the HOG of an image by Numba, and I ran it on 7000 images. it takes 10 sec time. but when I commented on the line that assigns a variable into an array ( hist[idx] += mag ), the time decreased into 5 milliseconds. what is the problem and what should I do about this.

@numba.jit( numba.uint64[:]( numba.uint8[:,:],numba.uint8), nopython=True )
def hog_numba( img, bins ):
    h,w = img.shape
    hist = np.zeros( bins, dtype=np.uint64)
    for i in range(h-1):
        for j in range(w-1):
            cy = img[i-1,j-1]*1 + img[i-1,j]*2 + img[i-1,j+1]*1 + img[i+1,j-1]*-1 + img[i+1,j]*-2 + img[i+1,j+1]*-1
            cx = img[i-1,j-1]*1 + img[i,j-1]*2 + img[i+1,j-1]*1 + img[i-1,j+1]*-1 + img[i,j+1]*-2 + img[i+1,j+1]*-1

            mag  =  numba.uint32(math.sqrt( math.pow(cx,2) + math.pow(cy,2) ) )

            if cx!=0:
                ang = math.atan2( cy, cx)#arc_tang
            else :
                if cy>0:
                    ang = math.pi / 2
                else:
                    ang = -math.pi / 2
                
            if ang<0:
                ang = abs(ang) + math.pi
            
            idx = (ang * bins) // (math.pi * 2 )
            idx = int(idx)

            #hist[idx] += mag

    
    return hist

below code used for benchmark

for _ in range(20):
    print('start')
    t = time.time()
    hists = []
    for i in range(8000):
        hist = hog_numba(img, 10)
    t = time.time() - t
    print('time:',t)

Solution

The difference in speed is not due to the fact that assignment is slow but due to the optimization of the JIT compiler. Indeed, if you comment the line hist[idx] += mag, then Numba can see that mag and idx do not need to be computed and can just remove the associated lines. Transitively, it can also remove the computation of ang, cx and cy. Finally it can fully remove the two nested loops. Such a code will be much faster but also useless. However, the JIT may not fully remove all the operation inside the two nested loops in practice since the JIT may not be able to fully optimize the code possibly due to Python transformations, guards and side effects. On my machine is does optimize the loop to a no-op. Indeed, it takes lass than 1 ms in average to compute a 8000 images of size (16_000,16_000) which is totally impossible on my machine (it should be at least 1000 times slower).

Thus, you cannot measure the time of an isolated instruction by just removing it and look for the time difference with Numba (or any optimized compiled code). Modern compilers are very advanced and trying to defeat them is not easy. If you still want to see if the cost actually comes mainly from the assignment, you could try to perform a summation like mag_sum += mag, idx_sum += idx and return/print the summation variables (otherwise the compiler can see that they are useless as they do not cause visible changes). On my machine the assignment version is only 9% slower than an implementation use a summation showing the assignment does not take most of the execution time (despite not being very fast probably due to the random access pattern).

The main source of slow down comes from the line (ang * bins) // (math.pi * 2 ) and more specifically from the multiplication/division by a constant. Pre-computing bins / (math.pi * 2) in a temporary variable ahead of time result in a 3.5 times faster code. The code is far from being optimized. Further optimizations include using vectorization, branch-less operations and parallelism (using simple precision and trying to remove the math.atan2 call may also help).