pythonnumpyperformancenumba

# Why is Jitted Numba function slower than original function?

I've written a function to create uniformly spaced points on a disk and since it's run quite often and on relatively large array I figured the application of `numba` would increase the speed significantly. However, upon running a quick test I've found that the `numba` function is more than twice as slow.

Is there a way to figure out what is slowing down the `numba` function?

Here's the function:

``````@njit(cache=True)
def generate_points_turbo(centre_point, radius, num_rings, x_axis=np.array([-1, 0, 0]), y_axis=np.array([0, 1, 0])):
"""
Generate uniformly spaced points inside a circle
Based on algorithm from:
http://www.holoborodko.com/pavel/2015/07/23/generating-equidistant-points-on-unit-disk/

Parameters
----------
centre_point : np.ndarray (1, 3)
num_rings : int
x_axis : np.ndarray
y_axis : np.ndarray

Returns
-------
points : np.ndarray (n, 3)

"""
if num_rings > 0:
delta_R = 1 / num_rings
k = np.arange(num_rings) + 1
points_per_ring = np.rint(np.pi / np.arcsin(1 / (2*k))).astype(np.int32)
num_points = points_per_ring.sum() + 1
ring_indices = np.zeros(int(num_rings)+1)
ring_indices[1:] = points_per_ring.cumsum()
ring_indices += 1
points = np.zeros((num_points, 3))

points[0, :] = centre_point

theta = np.linspace(0, 2 * np.pi, points_per_ring[indx]+1)
points[ring_indices[indx]:ring_indices[indx+1], :] = ((ring_radii[indx] * np.cos(theta[1:]) * x_axis[:, None]).T
+ (ring_radii[indx] * np.sin(theta[1:]) * y_axis[:, None]).T)
return points + centre_point

``````

And it's called like this:

``````centre_point = np.array([0,0,0])
num_rings = 15

``````

Would be great if someone knows why the function is slower when `numba` compiled or how to go about finding out what the bottleneck for the `numba` function is.

### Update: Possible computer specific size dependence

It seems the `numba` function is working, but the cross-over between where it's faster and slower maybe be hardware specific.

``````%timeit generate_points(centre_point, 1, 2)
99.5 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit generate_points_turbo(centre_point, 1, 2)
213 µs ± 8.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit generate_points(centre_point, 1, 20)
647 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit generate_points_turbo(centre_point, 1, 20)
314 µs ± 8.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit generate_points(centre_point, 1, 200)
11.9 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit generate_points_turbo(centre_point, 1, 200)
7.9 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

``````

After about 12-15 rings the `numba` function (*_turbo) starts to become a similar speed or faster on my machine, but the performance gains at larger size are smaller than expected. But seems like it is actually working, just that some part of the function is heavily size dependent.

Solution

• I got rid of all your transpositions / newaxis / 3D stuff that you were not using and got a x20 boost compared to your original solution. I replaced a `range` with a `prange` too for good measure, as you don't care in which order your points are calculated.

``````# Imports.
import matplotlib.pyplot as plt
from numba import njit, prange
import numpy as np

# "Turbo" function.
@njit(cache=True)
"""
Generate uniformly spaced points inside a circle
Based on algorithm from:
http://www.holoborodko.com/pavel/2015/07/23/generating-equidistant-points-on-unit-disk/

Parameters
----------
centre_point : np.ndarray (2,)
num_rings : int
x_axis : np.ndarray
y_axis : np.ndarray

Returns
-------
points : np.ndarray (n, 2)

"""
if not num_rings > 0:
return

delta_R = 1 / num_rings
ring_radii = np.linspace(delta_R, 1, num_rings) # Use a unit circle that we will scale only at the end.
k = np.arange(num_rings) + 1
points_per_ring = np.rint(np.pi / np.arcsin(1 / (2*k))).astype(np.int32)
num_points = points_per_ring.sum() + 1

points = np.zeros((num_points, 2))
n = 1 # n == 0 is the central point by design.

points_on_this_ring = points_per_ring[ring_number]
theta = np.linspace(0, 2 * np.pi, points_on_this_ring)
points[n: n+points_on_this_ring, 0] = r * np.cos(theta)
points[n: n+points_on_this_ring, 1] = r * np.sin(theta)
n += points_on_this_ring

return points * radius + centre_point

# Test that the result is accurate.
if __name__ == "__main__":

centre_point = np.array([0, 0])
num_rings = 10

fig, ax = plt.subplots()
ax.set_aspect(1)
ax.scatter(*p.T)
fig.show()
``````

`````` # Test time taken.
>>> from timeit import timeit
>>> from initial_code import generate_points_turbo as generate_points_turbo_stackoverflow

>>> 13.5 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit generate_points_turbo_stackoverflow(np.array([0, 0, 0]), radius, num_rings)
>>> 261 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

``````