Search code examples
pythonperformanceparallel-processingnumbalow-latency

How to use numba parallel?


I am trying to parallelize or optimize the following example code with numba. However, I still do not get the logic behind this.

The parallel diagnostics indicate that parallel structure is already optimal.

from numba import njit
@njit(parallel=True)
def SwVer(ResSwc,SwNew,Ia,Ja,Ka):
    for k in (Ia):
        for j in (Ja):
            for i in (Ka):
                if(SwNew[i,j,k]<ResSwc):
                    SwNew[i,j,k]=ResSwc
                if(SwNew[i,j,k]>(1-ResSwc)):
                    SwNew[i,j,k]=(1-ResSwc)
    
    return SwNew

import numpy as np
Imax=100
Jmax=100
Kmax=5

Ia=np.arange(0,Imax,1)
Ja=np.arange(0,Jmax,1)
Ka=np.arange(0,Kmax,1)

SwNew=np.random.random((Imax,Jmax,Kmax))
SwNew=SwVer(0.25,SwNew,Ia,Ja,Ka)

How can I truly parallelize this function? Does loop unrolling improve execution time? Is there any other way to improve my triple loop?

Thanks


Solution

  • Q : "Does loop unrolling improve execution time?"

    Loop unrolling is fine for avoiding the looping overheads, yet there are more problems for larger and multi-level looping, given the flattened "length" of the A[N,N,N] soon grows for N above ~ 1E3, 1E4, 1E5 way above just a few GB.

    Q : "Is there any other way to improve my triple loop?"

    Avoid passing redundant "parameters". It is expensive. The more the larger these get. Ia, Ja, Ka - re-represent the natural domain indices of the A[i,j,k], which are already present inside the A-instance, aren't they? Passing and receiving large, redundant parameters is a luxury we often prefer to avoid.

    @njit(parallel=True)
    def SwVer( ResSwc,   SwNew,           Ia, Ja, Ka ):
        #                     [: : :]     ||  ||  ||
        for                        k in (         Ka ):
            for                  j   in (     Ja ):
                for            i     in ( Ia ):
                    if ( SwNew[i,j,k] <        ResSwc ):
                         SwNew[i,j,k]  =       ResSwc
                    if ( SwNew[i,j,k] >  ( 1 - ResSwc ) ):
                         SwNew[i,j,k]  = ( 1 - ResSwc )
        #                SwNew[:,:,:]
        return           SwNew
    

    The as-was STATE-0 :

    The as-was code was executed in about ~ 1,824,168 [us] ~ 1.8 [s].


    A small STEP-for a man :

    In-place modifications are always faster, than having many same-sized intermediate instances to collect final results, which is another high-performance grade code anti-pattern.

    Just the removal of the last line return SwNew yields ~ 714,977 [us] ~ 0.7 [s]


    Now
    a small STEP for a Man,
    but
    a GIANT LEAP for the MANKIND ( ... The Performance ... ) :

    For a unique performance on trivial [i,j,k]-mapable transformations may like to try the brave numpy-vectorised computing.

    The whole trick is this short : np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ))

    Your function can run anywhere between 12086 [us] and 97810 [us] in numpy-vectorised mode. The physical RAM, cache, CPU, O/S workloads' details will cause variable effects, yet RAM-sizes matter for a smart-vectorised code, and the numpy is smart and a lot, the most:
    A[1000,1000,1000] ~ 8 [GB]-RAM-footprints. Yet details matter. A lot.
    A[ 100, 100, 100] ~ 8 [MB]-RAM-footprints. Now it does fit in L3/L2 cache ... that matters...
    A[ 10, 10, 10] ~ 8 [kB]-RAM-footprints. Now it does fit in L1d-cache ... that matters. A lot...


    Let's be Quantitative from the very start :

    We start here with a 2D-reduced dimensionality, because of the RAM.

    >>> from zmq import Stopwatch;  aClk = Stopwatch()     # a [us]-resolution clock
    >>> pass;    import gc;               gc.disable()     #   ...  elementary
    >>> pass;    import numpy as np
    #_______________________________________________________________________________
    >>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N, N ) ); aClk.stop()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "mtrand.pyx", line 861, in mtrand.RandomState.random_sample
      File "mtrand.pyx", line 167, in mtrand.cont0_array
    MemoryError
    

    Nevertheless, the vectorised code trick remains the same in principle for 1D- ... nD-tensor processing :

    #_______________________________________________________________________________
    #       [us]
    #_______________________________________________________________________________
    >>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
    17801992
    >>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
      184895
    >>> N = int( 1E2 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
        1585
    >>> N = int( 1E1 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
          44
    >>> N = int( 1E2 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
         465
    >>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
       48651
    >>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
     4954694
    >>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
    25549190
    #_______________________________________________________________________________
    #       [us] SEE THE GROWING COSTS FOR ram-ALLOCATIONS & STORAGE OF RANDOM num-s
    #_______________________________________________________________________________
    
    
    >>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
      471956
       50067
       49184
       42891
       48897
       52639
       45464
       48828
    #_______________________________________________________________________________
    #       [us] SEE THE 1st, resp. WAY-LOWER COSTS FOR 1st, resp. OTHER ram-ACCESS
    #_______________________________________________________________________________
    
    
    >>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
       71044
       12086
       16277
       28192
    #_______________________________________________________________________________
    #       [us] SEE ALSO THE "noise" IN THE LATENCY-COSTS IN 2nd+ RE-RUN-s
    #_______________________________________________________________________________
    
    >>> N = int( 1E3 );               aClk.start(); A = np.random.random( ( N, N ) );                              aClk.stop()
       45759
    >>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
       38362    
       46640
       37927
    
    >>> N = int( 1E4 );               aClk.start(); A = np.random.random( ( N, N ) );                              aClk.stop()
     4885472
    >>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
    35747022
    #_______________________________________________________________________________
    #       [us] SEE THE SHIFT IN LATENCY-COSTS AS ram-i/o COSTS DOMINATE > 1 [GB]
    #_______________________________________________________________________________
    
    
    >>> N = int( 1E3 );               aClk.start(); A = np.random.random( ( N, N ) );                              aClk.stop()
     2307509
       50446
       49464
       43006
       50958
       54800
       43418
       57091
       52135
       46451
    
    >>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
       97810
       20117
       14480
       22696
       31172
       14846
    #_______________________________________________________________________________
    #       [us] SEE THE "noise" FOR 2nd+ RE-RUN-s
    #_______________________________________________________________________________
    
    >>> N = int( 1E3 );               aClk.start(); A = np.random.random( ( N, N ) );                              aClk.stop()
       47437
    >>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
       39298
       19422
    #_______________________________________________________________________________
    #       [us] SEE THE "noise" FOR 2nd+ RE-RUN
    #_______________________________________________________________________________
    
    >>> N = int( 1E3 ); aClk.start();               A = np.random.random( ( N, N ) );                              aClk.stop()
       44814
    >>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
       42565
    
    >>> N = int( 1E3 ); aClk.start();               A = np.random.random( ( N, N ) );                              aClk.stop()
       43120
    >>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
       38039
    
    >>> N = int( 1E3 ); aClk.start();               A = np.random.random( ( N, N ) );                              aClk.stop()
       45296
    >>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
       41898
    #_______________________________________________________________________________
    #       [us] SEE THE "noise" FOR block-RE-randomised-RE-RUN-s
    #_______________________________________________________________________________