Optimal Iteration Technique for Calculating a Series (Large Sum)

My code produces the correct result, but it feels painstakingly slow. How could I make the Z00 function more efficient?

My goal here is to demonstrate numerical stability with respect to the cube size (i.e.n_max).

What would be the most efficient way of generating many function values and displaying them in a graph?

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import itertools as it
import time as t

def Z00(eta,n_max,mu,m,d3):
    precartesian = [range(-n_max,n_max+1),range(-n_max,n_max+1),range(-n_max,n_max+1)] #3d grid
    cartesian = list(it.product(*precartesian))
    gamma = np.sqrt(1+d3**2/(4*m**2+eta**2)) 
    sum=0
    for (n1,n2,n3) in cartesian:
        r2 = gamma**2*(n3-d3/2)**2+n1**2+n2**2 #r²
        sum+=1/((r2-eta**2)*(r2+mu**2)**2)
    return 1/(2*np.sqrt(np.pi))*((mu**2-eta**2)**2*sum+gamma**2*np.pi**2/mu*(eta**2-mu**2))

The Z00 function takes n_max, to create a 3d grid. Then, for each value on the grid a function value r2 is computed. This is further manipulated and summed up into sum.

I then calculate values and plot the function:

#set function parameters
nmax=10
mu=2
m=1

#set plot parameters
xmin=0
xmax=6

#compute function values
x = np.linspace(xmin,xmax,10000)
y = Z00(x,nmax,mu,m,1)
plt.plot(x,y)

In general, calculation is going to scale as n_max**3, since we are computing a function of values inside a cube. I have already tried using built in functions such as np.sum() along with list comprehension for calculating sum:

def Z00(eta,n_max,mu,m,d3):
    precartesian = [range(-n_max,n_max+1),range(-n_max,n_max+1),range(-n_max,n_max+1)] #3d grid
    cartesian = list(it.product(*precartesian))
    gamma = np.sqrt(1+d3**2/(4*m**2+eta**2))
    r2s = [gamma**2*(n3-d3/2)**2+n1**2+n2**2 for (n1,n2,n3) in cartesian]
    sum = np.sum([1/((r2-eta**2)*(r2+mu**2)**2) for r2 in r2s])
    return 1/(2*np.sqrt(np.pi))*((mu**2-eta**2)**2*sum+gamma**2*np.pi**2/mu*(eta**2-mu**2))

Also, I tried reducing the number of local variables assigned, as well as moving the 3d grid to a global variable, but that didn't help either.

Solution

NOTE There is an Addendum below that integrates my initial considerations.

I first streamlined your Z00 to Z01, avoiding recomputations, then I converted Z00 to Z02, a vectorial implementation (see details below).

Eventually I tested my implementations for correctness and benchmarked them against your original code,

13:46 Fri, Aug 02 ~/ $ ipython-3.12 -i faster.py
Python 3.12.3 (main, Apr 10 2024, 14:51:56) [GCC]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.25.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: # are the results the same?
   ...: for Z0x in (Z00, Z01, Z02): print(Z0x(x, nmax, mu, m, 1))
[ -2.87639267  -2.87639041  -2.87638363 ... 168.60197891 243.68957108
 493.83389926]
[ -2.87639267  -2.87639041  -2.87638363 ... 168.60197891 243.68957108
 493.83389926]
[ -2.87639267  -2.87639041  -2.87638363 ... 168.60197891 243.68957108
 493.83389926]

In [2]: # which is fastest?
   ...: for Z0x in (Z00, Z01, Z02):
   ...:     %timeit Z0x(x, nmax, mu, m, 1)
320 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
275 ms ± 442 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.16 s ± 706 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]:

As you can see all the three implementations agree on the result, the streamlined one is marginally faster than yours, and the fully vectorized one is horribly slower by a factor of 3÷4 (I wonder if I've made some kind of blooper in this implementation).

While Z01 is faster, I'd say that the compiler, or Numpy, already made some optimizations re repeated computations in Z00, and that (for now at least) vectorizing the computation doesn't help.

And here, as promised, my code with the three different functions and the data.

import numpy as np
from itertools import product

nmax = 10
mu = 2
m = 1

#set plot parameters
xmin = 0
xmax = 6

#compute function values
x = np.linspace(xmin, xmax, 10000)

def Z00(eta,n_max,mu,m,d3):
    precartesian = [range(-n_max,n_max+1),range(-n_max,n_max+1),range(-n_max,n_max+1)] #3d grid
    cartesian = list(product(*precartesian))
    gamma = np.sqrt(1+d3**2/(4*m**2+eta**2)) 
    sum=0
    for (n1,n2,n3) in cartesian:
        r2 = gamma**2*(n3-d3/2)**2+n1**2+n2**2 #r²
        sum+=1/((r2-eta**2)*(r2+mu**2)**2)
    return 1/(2*np.sqrt(np.pi))*((mu**2-eta**2)**2*sum+gamma**2*np.pi**2/mu*(eta**2-mu**2))

def Z01(eta, n_max, mu, m, d3):
    cartesian = list(product(*[range(-n_max,n_max+1)]*3))
    eta2 = eta**2
    gamma2 = 1 + d3**2 / (4*m**2+eta2)
    sum=0
    for (n1,n2,n3) in cartesian:
        r2 = gamma2*(n3-d3/2)**2+n1**2+n2**2 #r²
        sum+=1/((r2-eta2)*(r2+mu**2)**2)
    return 1/(2*np.sqrt(np.pi))*((mu**2-eta2)**2*sum+gamma2*np.pi**2/mu*(eta2-mu**2))

def Z02(eta, n_max, mu, m, d3):
    mnmx = np.linspace(-nmax, +nmax, nmax+nmax+1)
    X, Y, DZ = np.meshgrid(mnmx, mnmx, mnmx-d3/2)
    eta2 = eta**2
    gamma2 = 1 + d3**2 / (4*m**2+eta2)
    sum=0
    r2 = (X**2 + Y**2 + np.einsum('ijk,l->lijk', DZ**2, gamma2)).transpose((1,2,3,0))
    sum = np.sum(1/((r2-eta2)*(r2+mu**2)**2), axis=(0,1,2))
    return 1/(2*np.sqrt(np.pi))*((mu**2-eta2)**2*sum+gamma2*np.pi**2/mu*(eta**2-mu**2))

Addendum

I have tried with different x discretizations

X = np.linspace(0, 6, 10001)
x = np.linspace(0, 6,   501)

and these are my findings

In [4]: # correctness of different implementations, dependency on len(x)
   ...: for Z0x in (Z00, Z01, Z02):
   ...:     y = Z0x(X, nmax, mu, m, 1)
   ...:     print('%s(X, …):  '%Z0x.__name__, y[:41:20], y[-41::20])
   ...:     y = Z0x(x, nmax, mu, m, 1)
   ...:     print('%s(x, …):  '%Z0x.__name__, y[:3], y[-3:])
Z00(X, …):   [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229  41.55340637 493.83389926]
Z00(x, …):   [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229  41.55340637 493.83389926]
Z01(X, …):   [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229  41.55340637 493.83389926]
Z01(x, …):   [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229  41.55340637 493.83389926]
Z02(X, …):   [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229  41.55340637 493.83389926]
Z02(x, …):   [-2.87639267 -2.87548814 -2.87277102] [ 24.47783229  41.55340637 493.83389926]

In [5]: # which is fastest?
   ...: for Z0x in (Z00, Z01, Z02):
   ...:     print('%s(X, …):  '%Z0x.__name__, end='')
   ...:     %timeit Z0x(X, nmax, mu, m, 1)
   ...:     print('%s(x, …):  '%Z0x.__name__, end='')
   ...:     %timeit Z0x(x, nmax, mu, m, 1)
Z00(X, …):  340 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Z00(x, …):  137 ms ± 178 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Z01(X, …):  277 ms ± 548 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Z01(x, …):  121 ms ± 298 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Z02(X, …):  1.34 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Z02(x, …):  63.4 ms ± 41.7 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]:

I was surprised that results apparently don't depend on the discretization, but on second thought it's OK.
The problem with the vectorized code was the amount of memory involved, about 800 MB for each of different arrays (r2 and also intermediate results), using less memory the vectorized code is the faster.

In light of the points above, and the intended scope (plotting) I recommend using Z02 and a discretization in the order of 100÷200 points that is "good enough" for plot(x, Z02(x, …)).