python numpy performance numpy-ndarray numpy-slicing

How to efficiently slice numpy arrays? (Finite difference method)

I am trying to use the finite difference method with NumPy arrays, but the slicing is incredibly slow. It's not viable to use lists as I am applying math operations. Using matlab, equivalent code is executed in 2 seconds whilst the python code finishes executing for over 1 minute. My python code is:

import numpy as np
from scipy import interpolate 

def heston_explicit_nonuniform(S,V,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds):
    
    U = np.empty((Ns,Nv))
    
    for s in range(Ns):
        for v in range(Nv):
            U[s,v] = np.maximum( S[s] - K , 0)
        
 
    for t in range(Nt - 1):
        
        
        for v in range(Nv-1):
            U[0,v] = 0
            U[Ns-1,v] = np.maximum( Smax - K ,0)
            
        for s in range(Ns):
           
            U[s,Nv-1] = np.maximum(S[s] - K, 0)

        u = np.copy(U)
  

        for s in range(1,Ns-1):
            derV = (u[s,1] - u[s,0]) / (V[1] - V[0]) 
            derS = (u[s+1,0] - u[s-1,0]) / (S[s+1]-S[s-1])
            U[s,0] = u[s,0] + dt * ( -r*u[s,0] + (r-q) * S[s] * derS + kappa*theta*derV) 
   
        
        u = np.copy(U)
        
        for s in range(1,Ns-1):
            
            for v in range(1,Nv-1):
                derS = 0.5*(u[s+1,v] - u[s-1,v]) / (S[s+1] - S[s-1])
                derV = 0.5*(u[s,v+1] - u[s,v-1]) / (V[v+1] - V[v-1])
                derSS = (u[s+1,v] - 2*u[s,v] + u[s-1,v]) / ( (S[s+1] - S[s])*(S[s]-S[s-1]))
                derVV = (u[s,v+1] - 2*u[s,v] + u[s,v-1]) / ( (V[v+1] - V[v])*(V[v]-V[v-1]))
                derSV = (u[s+1,v+1] + u[s-1,v-1] - u[s-1,v+1] - u[s+1,v-1]) \
                    / (4 * (S[s+1] - S[s-1]) * (V[v+1] - V[v-1]))
                A = 0.5 * V[v] * (S[s]**2) * derSS 
                B = rho*sigma*V[v]*S[s] *derSV
                C = 0.5*(sigma**2)*V[v]*derVV
                D = 0.5*(r-q)*S[s]*derS
                E = kappa*(theta-V[v])*derV
                L = dt * (A + B + C + D + E - r*u[s,v])
                U[s,v] = L + u[s,v]
      
        if t %100==0:
            print(t)
            
  
    return U

Vmax=0.5
Vmin=0
Smin=0
T=0.15

K=100.
Smax = 2*K

Vmax=0.5

r= 0.02
q=0.05
rho=-0.9
v0=0.05
sigma=0.3
kappa=1.5
theta=0.04
L=17
Nv=39

Ns=79

Nt = 3000

dt = T/Nt

Vi = np.linspace(0,Vmax,Nv)
Si = np.linspace(0,Smax,Ns)
dv = (Vmax-Vmin)/Nv    
ds = (Smax-Smin)/Ns  
  
call = heston_explicit_nonuniform(Si,Vi,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds)

data = interpolate.RectBivariateSpline(Si,Vi,call)


z_new = data(101.52,0.05412)
print('FDM: ', z_new[0,0])

Solution

The thing is, there is no slicing in your code. Which is what is missing.

In numpy, you compute things on whole arrays, not elements by elements. In other words, you need to avoid for loops, that are pure python (and therefore very slow: the only reason python is quite fast when using numpy, is because most of the CPU time is spend in C code, not in python code).

So for example, to focus on a specific and quite easy part of your code, this loop

for s in range(1,Ns-1):
    derV = (u[s,1] - u[s,0]) / (V[1] - V[0]) 
    derS = (u[s+1,0] - u[s-1,0]) / (S[s+1]-S[s-1])
    U[s,0] = u[s,0] + dt * ( -r*u[s,0] + (r-q) * S[s] * derS + kappa*theta*derV)

should not exist (or, to be more accurate, it should exist only in numpy's internal code, in C) what you do here, is, for each s between 1 and Ns-1, you compute derV, that is a combination of sth element of 2 vectors (u[:,1] and u[:,0], so 2 columns of u) and a scalar (V[1]-V[0]).

So, all together that is Ns-2 computation for 2 Ns-2 vectors. You can let numpy do that

derV = (u[1:-1,1] - u[1:-1,0) / (V[1]-V[0])

And now, derV is an array of the Ns-2 value you compute

So this part of the code can be turned into

derV = (u[1:-1,1] - u[1:-1,0) / (V[1]-V[0])
for s in range(1,Ns-1):
    derS = (u[s+1,0] - u[s-1,0]) / (S[s+1]-S[s-1])
    U[s,0] = u[s,0] + dt * ( -r*u[s,0] + (r-q) * S[s] * derS + kappa*theta*derV[s-1])

So we have "vectorized" the computation of the Ns-2 derV value

Likewise

derS = (u[2:,0] - u[:-2,0]) / (S[2:]-S[:-2])

derV = (u[1:-1,1] - u[1:-1,0) / (V[1]-V[0])
derS = (u[2:,0] - u[:-2,0]) / (S[2:]-S[:-2])
for s in range(1,Ns-1):
    U[s,0] = u[s,0] + dt * ( -r*u[s,0] + (r-q) * S[s] * derS[s-1] + kappa*theta*derV[s-1])

And that last iterated computation, U[s,0] is also just Ns-2 independant computations, from the Ns-2 u[s,0] values, the Ns-2 derS values, the Ns-2 derV values, the Ns-2 S values, and some scalars.

So it can also be turned into

derV = (u[1:-1,1] - u[1:-1,0) / (V[1]-V[0])
derS = (u[2:,0] - u[:-2,0]) / (S[2:]-S[:-2])
U[1:-1,0] += dt * ( -r*u[1:-1,0] + (r-q) * S[1:-1] * derS + kappa*theta*derV)

If I continue this for your whole code

import numpy as np
from scipy import interpolate 

def heston_explicit_nonuniform(S,V,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds):
    
    U = np.empty((Ns,Nv))
    U[:,:] = np.maximum(S-K, 0)[:,None]
        
 
    for t in range(Nt - 1):
        U[0,:]=0
        U[Ns-1,:] = np.maximum(Smax-K,0)
        U[:,Nv-1] = np.maximum(S-K,0)
            

        u=U.copy()

        derV=(u[1:-1,1]-u[1:-1,0])/(V[1]-V[0])
        derS=(u[2:,0]-u[:-2,0])/(S[2:]-S[:-2])
        U[1:-1,0] = u[1:-1,0] + dt*(-r*u[1:-1,0] + (r-q)*S[1:-1] * derS + kappa*theta*derV)
        
        u = U.copy()
        
        derS = 0.5*(u[2:,1:-1] - u[:-2,1:-1]) / (S[2:]-S[:-2])[:,None]
        derV = 0.5*(u[1:-1,2:]-u[1:-1,:-2]) / (V[2:]-V[:-2])[None,:]
        derSS = (u[2:,1:-1]-2*u[1:-1,1:-1]+u[:-2,1:-1]) / ((S[2:]-S[1:-1]) * (S[1:-1]-S[:-2]))[:,None]
        derVV = (u[1:-1,2:]-2*u[1:-1,1:-1]+u[1:-1,:-2]) / ((V[2:]-V[1:-1]) * (V[1:-1]-V[:-2]))[None,:]
        derSV = (u[2:,2:] + u[:-2,:-2] - u[:-2, 2:] - u[2:,:-2]) \
               / 4 / (S[2:]-S[:-2])[:,None] / (V[2:]-V[:-2])[None,:]
        A = 0.5*V[None,1:-1] * (S[1:-1,None]**2) * derSS
        B = rho*sigma*V[None,1:-1]*S[1:-1,None] * derSV
        C = 0.5*(sigma**2)*V[None,1:-1]*derVV
        D = 0.5*(r-q)*S[1:-1, None]*derS
        E = kappa*(theta-V[None,1:-1])*derV
        L = dt * (A+B+C+D+E- r*u[1:-1,1:-1])
        U[1:-1,1:-1] += L
      
    return U

Vmax=0.5
Vmin=0
Smin=0
T=0.15

K=100.
Smax = 2*K

Vmax=0.5

r= 0.02
q=0.05
rho=-0.9
v0=0.05
sigma=0.3
kappa=1.5
theta=0.04
L=17
Nv=39

Ns=79

Nt = 3000

dt = T/Nt

Vi = np.linspace(0,Vmax,Nv)
Si = np.linspace(0,Smax,Ns)
dv = (Vmax-Vmin)/Nv    
ds = (Smax-Smin)/Ns  
  
call = heston_explicit_nonuniform(Si,Vi,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds)

data = interpolate.RectBivariateSpline(Si,Vi,call)


z_new = data(101.52,0.05412)
print('FDM: ', z_new[0,0])

It returns the exact same result (including the very last decimal place), because it is the exact same computation. Just, iterations are done inside numpy's internal code, not in python

Note that I haven't even tried to understand the algorithm. I just translated as a I would any arrays handling. Because, all the computation I've rewritten were batches of Ns-2 or Nv-2 or (Ns-2)(Nv-2) independent computation (the order wasn't important)

It would probably be interesting to go deeper in the algorithm itself, not just its coding. For example, I am pretty sure, one at least of the two copy() could be avoided.

It would be harder (but probably not impossible) to also vectorize the for t loop. Because this loop is different: computation done at the tth iteration is dependent on the result of the **(t-1)**th iteration's result. Which was not the case for the for s and for v loops

That may be possible using cumulative operator (that allows to use, in some specific cases, numpy "vectorization", that is hijacking internal numpy's for loop to replace yours, even when each iteration depends on the previous one. For example, using cumsum, cumprod. (Memory wise, that would imply to keep a big array of all values of all (t,s,v) combinations. But from your example, that would be hardly 10 millions values, which is really not a problem nowadays)

But still, without even making any effort to adapt to the problem, by simple translation of for loops into numpy batches computation, we already gained a ×100 factor (on my slow machine, it went from precisely 100 seconds to 1 second. It is not every day that I see cases where timing ratio computation are that obvious :-) )