Numba fast math does not improve speed

I run the following code with fastmath option enabled and disabled.

import numpy as np
from numba import jit
from threading import Thread
import time
import psutil
from tqdm import tqdm


@jit(nopython=True, fastmath=True)
def compute_angle(vectors):
    return 180 + np.degrees(np.arctan2(vectors[:, :, 1], vectors[:, :, 0]))



cpu_usage = list()
times = list()

# Log cpu usage
running = False
def threaded_function():
    while not running:
        time.sleep(0.1)
    print("Start logging CPU")
    while running:
        cpu_usage.append(psutil.cpu_percent())
    print("Stop logging CPU")
thread = Thread(target=threaded_function, args=())
thread.start()


iterations = 1000

# Generate frames
vectors_list = list()
for i in tqdm(range(iterations), total=iterations):
    vectors = np.random.randint(-50, 50, (500, 1000, 2))
    vectors_list.append(vectors)

for i in tqdm(range(iterations), total=iterations):
    s = time.time()
    compute_angle(vectors_list[i])
    e = time.time()
    times.append(e - s)
    # Do not count first iteration
    running = True

running = False

thread.join()

print("Average time per iteration", np.mean(times[1:]))
print("Average CPU usage:", np.mean(cpu_usage))

The results with fastmath=True are:

Average time per iteration 0.02076407738992044
Average CPU usage: 6.738916256157635`

The results with fastmath=False are:

Average time per iteration 0.020854528721149738
Average CPU usage: 6.676455696202531

Should I expect some gain since I am using mathematical operations? I also tried to install icc-rt but I am not sure how to check if it is enabled or not. Thank you!

Solution

There are a few things missing to get the SIMD vectorization working. For maximum performance it is also necessary to avoid costly temporary arrays, which may not be optimized away if you use a partly vectorized function.

Function calls have to be inlined
The memory access pattern must be known at compile time. In the following example this is done with assert vectors.shape[2]==2. Generally the shape of the last array could also be larger than two, which would be much more complicated to SIMD-vectorize.
Division by zero checks can also avoid SIMD-vectorization, and are slow if they are not optimized away. I do this manually by calculating div_pi=1/np.pi once and than a simple multiplication inside the loop. If a repeated division is not avoidable you can use error_model="numpy" to avoid the division by zero check.

Example

import numpy as np
import numba as nb

@nb.njit(fastmath=True)
def your_function(vectors):
    return 180 + np.degrees(np.arctan2(vectors[:, :, 1], vectors[:, :, 0]))

@nb.njit(fastmath=True)#False
def optimized_function(vectors):
    assert vectors.shape[2]==2

    res=np.empty((vectors.shape[0],vectors.shape[1]),dtype=vectors.dtype)
    div_pi=180/np.pi
    for i in range(vectors.shape[0]):
        for j in range(vectors.shape[1]):
            res[i,j]=np.arctan2(vectors[i,j,1],vectors[i,j,0])*div_pi+180
    return res

Timings

vectors=np.random.rand(1000,1000,2)

%timeit your_function(vectors)
#no difference between fastmath=True or False, no SIMD-vectorization at all
#23.3 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit optimized_function(vectors)
#with fastmath=False #SIMD-vectorized, but with the slower (more accurate) SVML algorithm
#9.03 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#with fastmath=True  #SIMD-vectorized, but with the faster(less accurate) SVML algorithm
#4.45 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)