I've written a function to create uniformly spaced points on a disk and since it's run quite often and on relatively large array I figured the application of numba
would increase the speed significantly. However, upon running a quick test I've found that the numba
function is more than twice as slow.
Is there a way to figure out what is slowing down the numba
function?
Here's the function:
@njit(cache=True)
def generate_points_turbo(centre_point, radius, num_rings, x_axis=np.array([-1, 0, 0]), y_axis=np.array([0, 1, 0])):
"""
Generate uniformly spaced points inside a circle
Based on algorithm from:
http://www.holoborodko.com/pavel/2015/07/23/generating-equidistant-points-on-unit-disk/
Parameters
----------
centre_point : np.ndarray (1, 3)
radius : float/int
num_rings : int
x_axis : np.ndarray
y_axis : np.ndarray
Returns
-------
points : np.ndarray (n, 3)
"""
if num_rings > 0:
delta_R = 1 / num_rings
ring_radii = np.linspace(delta_R, 1, int(num_rings)) * radius
k = np.arange(num_rings) + 1
points_per_ring = np.rint(np.pi / np.arcsin(1 / (2*k))).astype(np.int32)
num_points = points_per_ring.sum() + 1
ring_indices = np.zeros(int(num_rings)+1)
ring_indices[1:] = points_per_ring.cumsum()
ring_indices += 1
points = np.zeros((num_points, 3))
points[0, :] = centre_point
for indx in range(len(ring_radii)):
theta = np.linspace(0, 2 * np.pi, points_per_ring[indx]+1)
points[ring_indices[indx]:ring_indices[indx+1], :] = ((ring_radii[indx] * np.cos(theta[1:]) * x_axis[:, None]).T
+ (ring_radii[indx] * np.sin(theta[1:]) * y_axis[:, None]).T)
return points + centre_point
And it's called like this:
centre_point = np.array([0,0,0])
radius = 1
num_rings = 15
generate_points_turbo(centre_point, radius, num_rings )
Would be great if someone knows why the function is slower when numba
compiled or how to go about finding out what the bottleneck for the numba
function is.
It seems the numba
function is working, but the cross-over between where it's faster and slower maybe be hardware specific.
%timeit generate_points(centre_point, 1, 2)
99.5 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit generate_points_turbo(centre_point, 1, 2)
213 µs ± 8.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit generate_points(centre_point, 1, 20)
647 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit generate_points_turbo(centre_point, 1, 20)
314 µs ± 8.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit generate_points(centre_point, 1, 200)
11.9 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit generate_points_turbo(centre_point, 1, 200)
7.9 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
After about 12-15 rings the numba
function (*_turbo) starts to become a similar speed or faster on my machine, but the performance gains at larger size are smaller than expected. But seems like it is actually working, just that some part of the function is heavily size dependent.
I got rid of all your transpositions / newaxis / 3D stuff that you were not using and got a x20 boost compared to your original solution. I replaced a range
with a prange
too for good measure, as you don't care in which order your points are calculated.
# Imports.
import matplotlib.pyplot as plt
from numba import njit, prange
import numpy as np
# "Turbo" function.
@njit(cache=True)
def generate_points_turbo(centre_point, radius, num_rings):
"""
Generate uniformly spaced points inside a circle
Based on algorithm from:
http://www.holoborodko.com/pavel/2015/07/23/generating-equidistant-points-on-unit-disk/
Parameters
----------
centre_point : np.ndarray (2,)
radius : float/int
num_rings : int
x_axis : np.ndarray
y_axis : np.ndarray
Returns
-------
points : np.ndarray (n, 2)
"""
if not num_rings > 0:
return
delta_R = 1 / num_rings
ring_radii = np.linspace(delta_R, 1, num_rings) # Use a unit circle that we will scale only at the end.
k = np.arange(num_rings) + 1
points_per_ring = np.rint(np.pi / np.arcsin(1 / (2*k))).astype(np.int32)
num_points = points_per_ring.sum() + 1
points = np.zeros((num_points, 2))
n = 1 # n == 0 is the central point by design.
for ring_number in prange(len(ring_radii)):
r = ring_radii[ring_number] # The local radius between 1/num_rings and 1.
points_on_this_ring = points_per_ring[ring_number]
theta = np.linspace(0, 2 * np.pi, points_on_this_ring)
points[n: n+points_on_this_ring, 0] = r * np.cos(theta)
points[n: n+points_on_this_ring, 1] = r * np.sin(theta)
n += points_on_this_ring
return points * radius + centre_point
# Test that the result is accurate.
if __name__ == "__main__":
centre_point = np.array([0, 0])
radius = 3.14159
num_rings = 10
p = generate_points_turbo(centre_point, radius, num_rings)
fig, ax = plt.subplots()
ax.set_aspect(1)
ax.scatter(*p.T)
fig.show()
# Test time taken.
>>> from timeit import timeit
>>> from initial_code import generate_points_turbo as generate_points_turbo_stackoverflow
>>> %timeit generate_points_turbo(centre_point, radius, num_rings)
>>> 13.5 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit generate_points_turbo_stackoverflow(np.array([0, 0, 0]), radius, num_rings)
>>> 261 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)