python performance parallel-processing numba low-latency

How to use numba parallel?

I am trying to parallelize or optimize the following example code with numba. However, I still do not get the logic behind this.

The parallel diagnostics indicate that parallel structure is already optimal.

from numba import njit
@njit(parallel=True)
def SwVer(ResSwc,SwNew,Ia,Ja,Ka):
    for k in (Ia):
        for j in (Ja):
            for i in (Ka):
                if(SwNew[i,j,k]<ResSwc):
                    SwNew[i,j,k]=ResSwc
                if(SwNew[i,j,k]>(1-ResSwc)):
                    SwNew[i,j,k]=(1-ResSwc)
    
    return SwNew

import numpy as np
Imax=100
Jmax=100
Kmax=5

Ia=np.arange(0,Imax,1)
Ja=np.arange(0,Jmax,1)
Ka=np.arange(0,Kmax,1)

SwNew=np.random.random((Imax,Jmax,Kmax))
SwNew=SwVer(0.25,SwNew,Ia,Ja,Ka)

How can I truly parallelize this function? Does loop unrolling improve execution time? Is there any other way to improve my triple loop?

Thanks

Solution

Q : "Does loop unrolling improve execution time?"

Loop unrolling is fine for avoiding the looping overheads, yet there are more problems for larger and multi-level looping, given the flattened "length" of the A[N,N,N] soon grows for N above ~ 1E3, 1E4, 1E5 way above just a few GB.

Q : "Is there any other way to improve my triple loop?"

Avoid passing redundant "parameters". It is expensive. The more the larger these get. Ia, Ja, Ka - re-represent the natural domain indices of the A[i,j,k], which are already present inside the A-instance, aren't they? Passing and receiving large, redundant parameters is a luxury we often prefer to avoid.

@njit(parallel=True)
def SwVer( ResSwc,   SwNew,           Ia, Ja, Ka ):
    #                     [: : :]     ||  ||  ||
    for                        k in (         Ka ):
        for                  j   in (     Ja ):
            for            i     in ( Ia ):
                if ( SwNew[i,j,k] <        ResSwc ):
                     SwNew[i,j,k]  =       ResSwc
                if ( SwNew[i,j,k] >  ( 1 - ResSwc ) ):
                     SwNew[i,j,k]  = ( 1 - ResSwc )
    #                SwNew[:,:,:]
    return           SwNew

The as-was STATE-0 :

The as-was code was executed in about ~ 1,824,168 [us] ~ 1.8 [s].

A small STEP-for a man :

In-place modifications are always faster, than having many same-sized intermediate instances to collect final results, which is another high-performance grade code anti-pattern.

Just the removal of the last line return SwNew yields ~ 714,977 [us] ~ 0.7 [s]

Now
a small STEP for a Man,
but
a GIANT LEAP for the MANKIND _{( ... The Performance ... )} :

For a unique performance on trivial [i,j,k]-mapable transformations may like to try the brave numpy-vectorised computing.

The whole trick is this short : np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ))

Your function can run anywhere between 12086 [us] and 97810 [us] in numpy-vectorised mode. The physical RAM, cache, CPU, O/S workloads' details will cause variable effects, yet RAM-sizes matter for a smart-vectorised code, and the numpy is smart and a lot, the most:
A[1000,1000,1000] ~ 8 [GB]-RAM-footprints. Yet details matter. A lot.
A[ 100, 100, 100] ~ 8 [MB]-RAM-footprints. Now it does fit in L3/L2 cache ... that matters...
A[ 10, 10, 10] ~ 8 [kB]-RAM-footprints. Now it does fit in L1d-cache ... that matters. A lot...

Let's be Quantitative from the very start :

We start here with a 2D-reduced dimensionality, because of the RAM.

>>> from zmq import Stopwatch;  aClk = Stopwatch()     # a [us]-resolution clock
>>> pass;    import gc;               gc.disable()     #   ...  elementary
>>> pass;    import numpy as np
#_______________________________________________________________________________
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N, N ) ); aClk.stop()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 861, in mtrand.RandomState.random_sample
  File "mtrand.pyx", line 167, in mtrand.cont0_array
MemoryError

Nevertheless, the vectorised code trick remains the same in principle for 1D- ... nD-tensor processing :

#_______________________________________________________________________________
#       [us]
#_______________________________________________________________________________
>>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
17801992
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
  184895
>>> N = int( 1E2 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
    1585
>>> N = int( 1E1 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
      44
>>> N = int( 1E2 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
     465
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
   48651
>>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
 4954694
>>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
25549190
#_______________________________________________________________________________
#       [us] SEE THE GROWING COSTS FOR ram-ALLOCATIONS & STORAGE OF RANDOM num-s
#_______________________________________________________________________________


>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
  471956
   50067
   49184
   42891
   48897
   52639
   45464
   48828
#_______________________________________________________________________________
#       [us] SEE THE 1st, resp. WAY-LOWER COSTS FOR 1st, resp. OTHER ram-ACCESS
#_______________________________________________________________________________


>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
   71044
   12086
   16277
   28192
#_______________________________________________________________________________
#       [us] SEE ALSO THE "noise" IN THE LATENCY-COSTS IN 2nd+ RE-RUN-s
#_______________________________________________________________________________

>>> N = int( 1E3 );               aClk.start(); A = np.random.random( ( N, N ) );                              aClk.stop()
   45759
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
   38362    
   46640
   37927

>>> N = int( 1E4 );               aClk.start(); A = np.random.random( ( N, N ) );                              aClk.stop()
 4885472
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
35747022
#_______________________________________________________________________________
#       [us] SEE THE SHIFT IN LATENCY-COSTS AS ram-i/o COSTS DOMINATE > 1 [GB]
#_______________________________________________________________________________


>>> N = int( 1E3 );               aClk.start(); A = np.random.random( ( N, N ) );                              aClk.stop()
 2307509
   50446
   49464
   43006
   50958
   54800
   43418
   57091
   52135
   46451

>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
   97810
   20117
   14480
   22696
   31172
   14846
#_______________________________________________________________________________
#       [us] SEE THE "noise" FOR 2nd+ RE-RUN-s
#_______________________________________________________________________________

>>> N = int( 1E3 );               aClk.start(); A = np.random.random( ( N, N ) );                              aClk.stop()
   47437
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
   39298
   19422
#_______________________________________________________________________________
#       [us] SEE THE "noise" FOR 2nd+ RE-RUN
#_______________________________________________________________________________

>>> N = int( 1E3 ); aClk.start();               A = np.random.random( ( N, N ) );                              aClk.stop()
   44814
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
   42565

>>> N = int( 1E3 ); aClk.start();               A = np.random.random( ( N, N ) );                              aClk.stop()
   43120
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
   38039

>>> N = int( 1E3 ); aClk.start();               A = np.random.random( ( N, N ) );                              aClk.stop()
   45296
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
   41898
#_______________________________________________________________________________
#       [us] SEE THE "noise" FOR block-RE-randomised-RE-RUN-s
#_______________________________________________________________________________

How to use numba parallel?

The as-was STATE-0 :

A small STEP-for a man :

Nowa small STEP for a Man,buta GIANT LEAP for the MANKIND ( ... The Performance ... ) :

Let's be Quantitative from the very start :

Now
a small STEP for a Man,
but
a GIANT LEAP for the MANKIND _{( ... The Performance ... )} :