Search code examples
pythonnumpyperformancenumpy-ndarraynumpy-slicing

How to efficiently slice numpy arrays? (Finite difference method)


I am trying to use the finite difference method with NumPy arrays, but the slicing is incredibly slow. It's not viable to use lists as I am applying math operations. Using matlab, equivalent code is executed in 2 seconds whilst the python code finishes executing for over 1 minute. My python code is:

import numpy as np
from scipy import interpolate 

def heston_explicit_nonuniform(S,V,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds):
    
    U = np.empty((Ns,Nv))
    
    for s in range(Ns):
        for v in range(Nv):
            U[s,v] = np.maximum( S[s] - K , 0)
        
 
    for t in range(Nt - 1):
        
        
        for v in range(Nv-1):
            U[0,v] = 0
            U[Ns-1,v] = np.maximum( Smax - K ,0)
            
        for s in range(Ns):
           
            U[s,Nv-1] = np.maximum(S[s] - K, 0)

        u = np.copy(U)
  

        for s in range(1,Ns-1):
            derV = (u[s,1] - u[s,0]) / (V[1] - V[0]) 
            derS = (u[s+1,0] - u[s-1,0]) / (S[s+1]-S[s-1])
            U[s,0] = u[s,0] + dt * ( -r*u[s,0] + (r-q) * S[s] * derS + kappa*theta*derV) 
   
        
        u = np.copy(U)
        
        for s in range(1,Ns-1):
            
            for v in range(1,Nv-1):
                derS = 0.5*(u[s+1,v] - u[s-1,v]) / (S[s+1] - S[s-1])
                derV = 0.5*(u[s,v+1] - u[s,v-1]) / (V[v+1] - V[v-1])
                derSS = (u[s+1,v] - 2*u[s,v] + u[s-1,v]) / ( (S[s+1] - S[s])*(S[s]-S[s-1]))
                derVV = (u[s,v+1] - 2*u[s,v] + u[s,v-1]) / ( (V[v+1] - V[v])*(V[v]-V[v-1]))
                derSV = (u[s+1,v+1] + u[s-1,v-1] - u[s-1,v+1] - u[s+1,v-1]) \
                    / (4 * (S[s+1] - S[s-1]) * (V[v+1] - V[v-1]))
                A = 0.5 * V[v] * (S[s]**2) * derSS 
                B = rho*sigma*V[v]*S[s] *derSV
                C = 0.5*(sigma**2)*V[v]*derVV
                D = 0.5*(r-q)*S[s]*derS
                E = kappa*(theta-V[v])*derV
                L = dt * (A + B + C + D + E - r*u[s,v])
                U[s,v] = L + u[s,v]
      
        if t %100==0:
            print(t)
            
  
    return U

Vmax=0.5
Vmin=0
Smin=0
T=0.15

K=100.
Smax = 2*K

Vmax=0.5

r= 0.02
q=0.05
rho=-0.9
v0=0.05
sigma=0.3
kappa=1.5
theta=0.04
L=17
Nv=39

Ns=79

Nt = 3000

dt = T/Nt

Vi = np.linspace(0,Vmax,Nv)
Si = np.linspace(0,Smax,Ns)
dv = (Vmax-Vmin)/Nv    
ds = (Smax-Smin)/Ns  
  
call = heston_explicit_nonuniform(Si,Vi,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds)

data = interpolate.RectBivariateSpline(Si,Vi,call)


z_new = data(101.52,0.05412)
print('FDM: ', z_new[0,0])

Solution

  • The thing is, there is no slicing in your code. Which is what is missing.

    In numpy, you compute things on whole arrays, not elements by elements. In other words, you need to avoid for loops, that are pure python (and therefore very slow: the only reason python is quite fast when using numpy, is because most of the CPU time is spend in C code, not in python code).

    So for example, to focus on a specific and quite easy part of your code, this loop

    for s in range(1,Ns-1):
        derV = (u[s,1] - u[s,0]) / (V[1] - V[0]) 
        derS = (u[s+1,0] - u[s-1,0]) / (S[s+1]-S[s-1])
        U[s,0] = u[s,0] + dt * ( -r*u[s,0] + (r-q) * S[s] * derS + kappa*theta*derV) 
    

    should not exist (or, to be more accurate, it should exist only in numpy's internal code, in C) what you do here, is, for each s between 1 and Ns-1, you compute derV, that is a combination of sth element of 2 vectors (u[:,1] and u[:,0], so 2 columns of u) and a scalar (V[1]-V[0]).

    So, all together that is Ns-2 computation for 2 Ns-2 vectors. You can let numpy do that

    derV = (u[1:-1,1] - u[1:-1,0) / (V[1]-V[0])
    

    And now, derV is an array of the Ns-2 value you compute

    So this part of the code can be turned into

    derV = (u[1:-1,1] - u[1:-1,0) / (V[1]-V[0])
    for s in range(1,Ns-1):
        derS = (u[s+1,0] - u[s-1,0]) / (S[s+1]-S[s-1])
        U[s,0] = u[s,0] + dt * ( -r*u[s,0] + (r-q) * S[s] * derS + kappa*theta*derV[s-1]) 
    

    So we have "vectorized" the computation of the Ns-2 derV value

    Likewise

    derS = (u[2:,0] - u[:-2,0]) / (S[2:]-S[:-2])
    

    So

    derV = (u[1:-1,1] - u[1:-1,0) / (V[1]-V[0])
    derS = (u[2:,0] - u[:-2,0]) / (S[2:]-S[:-2])
    for s in range(1,Ns-1):
        U[s,0] = u[s,0] + dt * ( -r*u[s,0] + (r-q) * S[s] * derS[s-1] + kappa*theta*derV[s-1]) 
    

    And that last iterated computation, U[s,0] is also just Ns-2 independant computations, from the Ns-2 u[s,0] values, the Ns-2 derS values, the Ns-2 derV values, the Ns-2 S values, and some scalars.

    So it can also be turned into

    derV = (u[1:-1,1] - u[1:-1,0) / (V[1]-V[0])
    derS = (u[2:,0] - u[:-2,0]) / (S[2:]-S[:-2])
    U[1:-1,0] += dt * ( -r*u[1:-1,0] + (r-q) * S[1:-1] * derS + kappa*theta*derV)
    

    If I continue this for your whole code

    import numpy as np
    from scipy import interpolate 
    
    def heston_explicit_nonuniform(S,V,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds):
        
        U = np.empty((Ns,Nv))
        U[:,:] = np.maximum(S-K, 0)[:,None]
            
     
        for t in range(Nt - 1):
            U[0,:]=0
            U[Ns-1,:] = np.maximum(Smax-K,0)
            U[:,Nv-1] = np.maximum(S-K,0)
                
    
            u=U.copy()
    
            derV=(u[1:-1,1]-u[1:-1,0])/(V[1]-V[0])
            derS=(u[2:,0]-u[:-2,0])/(S[2:]-S[:-2])
            U[1:-1,0] = u[1:-1,0] + dt*(-r*u[1:-1,0] + (r-q)*S[1:-1] * derS + kappa*theta*derV)
            
            u = U.copy()
            
            derS = 0.5*(u[2:,1:-1] - u[:-2,1:-1]) / (S[2:]-S[:-2])[:,None]
            derV = 0.5*(u[1:-1,2:]-u[1:-1,:-2]) / (V[2:]-V[:-2])[None,:]
            derSS = (u[2:,1:-1]-2*u[1:-1,1:-1]+u[:-2,1:-1]) / ((S[2:]-S[1:-1]) * (S[1:-1]-S[:-2]))[:,None]
            derVV = (u[1:-1,2:]-2*u[1:-1,1:-1]+u[1:-1,:-2]) / ((V[2:]-V[1:-1]) * (V[1:-1]-V[:-2]))[None,:]
            derSV = (u[2:,2:] + u[:-2,:-2] - u[:-2, 2:] - u[2:,:-2]) \
                   / 4 / (S[2:]-S[:-2])[:,None] / (V[2:]-V[:-2])[None,:]
            A = 0.5*V[None,1:-1] * (S[1:-1,None]**2) * derSS
            B = rho*sigma*V[None,1:-1]*S[1:-1,None] * derSV
            C = 0.5*(sigma**2)*V[None,1:-1]*derVV
            D = 0.5*(r-q)*S[1:-1, None]*derS
            E = kappa*(theta-V[None,1:-1])*derV
            L = dt * (A+B+C+D+E- r*u[1:-1,1:-1])
            U[1:-1,1:-1] += L
          
        return U
    
    Vmax=0.5
    Vmin=0
    Smin=0
    T=0.15
    
    K=100.
    Smax = 2*K
    
    Vmax=0.5
    
    r= 0.02
    q=0.05
    rho=-0.9
    v0=0.05
    sigma=0.3
    kappa=1.5
    theta=0.04
    L=17
    Nv=39
    
    Ns=79
    
    Nt = 3000
    
    dt = T/Nt
    
    Vi = np.linspace(0,Vmax,Nv)
    Si = np.linspace(0,Smax,Ns)
    dv = (Vmax-Vmin)/Nv    
    ds = (Smax-Smin)/Ns  
      
    call = heston_explicit_nonuniform(Si,Vi,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds)
    
    data = interpolate.RectBivariateSpline(Si,Vi,call)
    
    
    z_new = data(101.52,0.05412)
    print('FDM: ', z_new[0,0])
    

    It returns the exact same result (including the very last decimal place), because it is the exact same computation. Just, iterations are done inside numpy's internal code, not in python

    Note that I haven't even tried to understand the algorithm. I just translated as a I would any arrays handling. Because, all the computation I've rewritten were batches of Ns-2 or Nv-2 or (Ns-2)(Nv-2) independent computation (the order wasn't important)

    It would probably be interesting to go deeper in the algorithm itself, not just its coding. For example, I am pretty sure, one at least of the two copy() could be avoided.

    It would be harder (but probably not impossible) to also vectorize the for t loop. Because this loop is different: computation done at the tth iteration is dependent on the result of the **(t-1)**th iteration's result. Which was not the case for the for s and for v loops

    That may be possible using cumulative operator (that allows to use, in some specific cases, numpy "vectorization", that is hijacking internal numpy's for loop to replace yours, even when each iteration depends on the previous one. For example, using cumsum, cumprod. (Memory wise, that would imply to keep a big array of all values of all (t,s,v) combinations. But from your example, that would be hardly 10 millions values, which is really not a problem nowadays)

    But still, without even making any effort to adapt to the problem, by simple translation of for loops into numpy batches computation, we already gained a ×100 factor (on my slow machine, it went from precisely 100 seconds to 1 second. It is not every day that I see cases where timing ratio computation are that obvious :-) )