My code produces the correct result, but it feels painstakingly slow.
How could I make the Z00
function more efficient?
My goal here is to demonstrate numerical stability with respect to the cube size (i.e.n_max
).
What would be the most efficient way of generating many function values and displaying them in a graph?
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import itertools as it
import time as t
def Z00(eta,n_max,mu,m,d3):
precartesian = [range(-n_max,n_max+1),range(-n_max,n_max+1),range(-n_max,n_max+1)] #3d grid
cartesian = list(it.product(*precartesian))
gamma = np.sqrt(1+d3**2/(4*m**2+eta**2))
sum=0
for (n1,n2,n3) in cartesian:
r2 = gamma**2*(n3-d3/2)**2+n1**2+n2**2 #r²
sum+=1/((r2-eta**2)*(r2+mu**2)**2)
return 1/(2*np.sqrt(np.pi))*((mu**2-eta**2)**2*sum+gamma**2*np.pi**2/mu*(eta**2-mu**2))
The Z00
function takes n_max
, to create a 3d grid. Then, for each value on the grid a function value r2
is computed. This is further manipulated and summed up into sum
.
I then calculate values and plot the function:
#set function parameters
nmax=10
mu=2
m=1
#set plot parameters
xmin=0
xmax=6
#compute function values
x = np.linspace(xmin,xmax,10000)
y = Z00(x,nmax,mu,m,1)
plt.plot(x,y)
In general, calculation is going to scale as n_max**3
, since we are computing a function of values inside a cube. I have already tried using built in functions such as np.sum()
along with list comprehension for calculating sum:
def Z00(eta,n_max,mu,m,d3):
precartesian = [range(-n_max,n_max+1),range(-n_max,n_max+1),range(-n_max,n_max+1)] #3d grid
cartesian = list(it.product(*precartesian))
gamma = np.sqrt(1+d3**2/(4*m**2+eta**2))
r2s = [gamma**2*(n3-d3/2)**2+n1**2+n2**2 for (n1,n2,n3) in cartesian]
sum = np.sum([1/((r2-eta**2)*(r2+mu**2)**2) for r2 in r2s])
return 1/(2*np.sqrt(np.pi))*((mu**2-eta**2)**2*sum+gamma**2*np.pi**2/mu*(eta**2-mu**2))
Also, I tried reducing the number of local variables assigned, as well as moving the 3d grid to a global variable, but that didn't help either.
NOTE There is an Addendum below that integrates my initial considerations.
I first streamlined your Z00
to Z01
, avoiding recomputations, then I converted Z00
to Z02
, a vectorial implementation (see details below).
Eventually I tested my implementations for correctness and benchmarked them against your original code,
13:46 Fri, Aug 02 ~/ $ ipython-3.12 -i faster.py
Python 3.12.3 (main, Apr 10 2024, 14:51:56) [GCC]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.25.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: # are the results the same?
...: for Z0x in (Z00, Z01, Z02): print(Z0x(x, nmax, mu, m, 1))
[ -2.87639267 -2.87639041 -2.87638363 ... 168.60197891 243.68957108
493.83389926]
[ -2.87639267 -2.87639041 -2.87638363 ... 168.60197891 243.68957108
493.83389926]
[ -2.87639267 -2.87639041 -2.87638363 ... 168.60197891 243.68957108
493.83389926]
In [2]: # which is fastest?
...: for Z0x in (Z00, Z01, Z02):
...: %timeit Z0x(x, nmax, mu, m, 1)
320 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
275 ms ± 442 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.16 s ± 706 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]:
As you can see all the three implementations agree on the result, the streamlined one is marginally faster than yours, and the fully vectorized one is horribly slower by a factor of 3÷4 (I wonder if I've made some kind of blooper in this implementation).
While Z01
is faster, I'd say that the compiler, or Numpy, already made some optimizations re repeated computations in Z00
, and that (for now at least) vectorizing the computation doesn't help.
And here, as promised, my code with the three different functions and the data.
import numpy as np
from itertools import product
nmax = 10
mu = 2
m = 1
#set plot parameters
xmin = 0
xmax = 6
#compute function values
x = np.linspace(xmin, xmax, 10000)
def Z00(eta,n_max,mu,m,d3):
precartesian = [range(-n_max,n_max+1),range(-n_max,n_max+1),range(-n_max,n_max+1)] #3d grid
cartesian = list(product(*precartesian))
gamma = np.sqrt(1+d3**2/(4*m**2+eta**2))
sum=0
for (n1,n2,n3) in cartesian:
r2 = gamma**2*(n3-d3/2)**2+n1**2+n2**2 #r²
sum+=1/((r2-eta**2)*(r2+mu**2)**2)
return 1/(2*np.sqrt(np.pi))*((mu**2-eta**2)**2*sum+gamma**2*np.pi**2/mu*(eta**2-mu**2))
def Z01(eta, n_max, mu, m, d3):
cartesian = list(product(*[range(-n_max,n_max+1)]*3))
eta2 = eta**2
gamma2 = 1 + d3**2 / (4*m**2+eta2)
sum=0
for (n1,n2,n3) in cartesian:
r2 = gamma2*(n3-d3/2)**2+n1**2+n2**2 #r²
sum+=1/((r2-eta2)*(r2+mu**2)**2)
return 1/(2*np.sqrt(np.pi))*((mu**2-eta2)**2*sum+gamma2*np.pi**2/mu*(eta2-mu**2))
def Z02(eta, n_max, mu, m, d3):
mnmx = np.linspace(-nmax, +nmax, nmax+nmax+1)
X, Y, DZ = np.meshgrid(mnmx, mnmx, mnmx-d3/2)
eta2 = eta**2
gamma2 = 1 + d3**2 / (4*m**2+eta2)
sum=0
r2 = (X**2 + Y**2 + np.einsum('ijk,l->lijk', DZ**2, gamma2)).transpose((1,2,3,0))
sum = np.sum(1/((r2-eta2)*(r2+mu**2)**2), axis=(0,1,2))
return 1/(2*np.sqrt(np.pi))*((mu**2-eta2)**2*sum+gamma2*np.pi**2/mu*(eta**2-mu**2))
Addendum
I have tried with different x
discretizations
X = np.linspace(0, 6, 10001)
x = np.linspace(0, 6, 501)
and these are my findings
In [4]: # correctness of different implementations, dependency on len(x)
...: for Z0x in (Z00, Z01, Z02):
...: y = Z0x(X, nmax, mu, m, 1)
...: print('%s(X, …): '%Z0x.__name__, y[:41:20], y[-41::20])
...: y = Z0x(x, nmax, mu, m, 1)
...: print('%s(x, …): '%Z0x.__name__, y[:3], y[-3:])
Z00(X, …): [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229 41.55340637 493.83389926]
Z00(x, …): [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229 41.55340637 493.83389926]
Z01(X, …): [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229 41.55340637 493.83389926]
Z01(x, …): [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229 41.55340637 493.83389926]
Z02(X, …): [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229 41.55340637 493.83389926]
Z02(x, …): [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229 41.55340637 493.83389926]
In [5]: # which is fastest?
...: for Z0x in (Z00, Z01, Z02):
...: print('%s(X, …): '%Z0x.__name__, end='')
...: %timeit Z0x(X, nmax, mu, m, 1)
...: print('%s(x, …): '%Z0x.__name__, end='')
...: %timeit Z0x(x, nmax, mu, m, 1)
Z00(X, …): 340 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Z00(x, …): 137 ms ± 178 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Z01(X, …): 277 ms ± 548 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Z01(x, …): 121 ms ± 298 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Z02(X, …): 1.34 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Z02(x, …): 63.4 ms ± 41.7 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]:
r2
and also intermediate results), using less memory the vectorized code is the faster.In light of the points above, and the intended scope (plotting) I recommend using Z02
and a discretization in the order of 100÷200 points that is "good enough" for plot(x, Z02(x, …))
.