I am trying to parallelize or optimize the following example code with numba. However, I still do not get the logic behind this.
The parallel diagnostics indicate that parallel structure is already optimal.
from numba import njit
@njit(parallel=True)
def SwVer(ResSwc,SwNew,Ia,Ja,Ka):
for k in (Ia):
for j in (Ja):
for i in (Ka):
if(SwNew[i,j,k]<ResSwc):
SwNew[i,j,k]=ResSwc
if(SwNew[i,j,k]>(1-ResSwc)):
SwNew[i,j,k]=(1-ResSwc)
return SwNew
import numpy as np
Imax=100
Jmax=100
Kmax=5
Ia=np.arange(0,Imax,1)
Ja=np.arange(0,Jmax,1)
Ka=np.arange(0,Kmax,1)
SwNew=np.random.random((Imax,Jmax,Kmax))
SwNew=SwVer(0.25,SwNew,Ia,Ja,Ka)
How can I truly parallelize this function? Does loop unrolling improve execution time? Is there any other way to improve my triple loop?
Thanks
Q : "Does loop unrolling improve execution time?"
Loop unrolling is fine for avoiding the looping overheads, yet there are more problems for larger and multi-level looping, given the flattened "length" of the A[N,N,N]
soon grows for N
above ~ 1E3, 1E4, 1E5
way above just a few GB
.
Q : "Is there any other way to improve my triple loop?"
Avoid passing redundant "parameters". It is expensive. The more the larger these get.
Ia, Ja, Ka
- re-represent the natural domain indices of the A[i,j,k]
, which are already present inside the A
-instance, aren't they? Passing and receiving large, redundant parameters is a luxury we often prefer to avoid.
@njit(parallel=True)
def SwVer( ResSwc, SwNew, Ia, Ja, Ka ):
# [: : :] || || ||
for k in ( Ka ):
for j in ( Ja ):
for i in ( Ia ):
if ( SwNew[i,j,k] < ResSwc ):
SwNew[i,j,k] = ResSwc
if ( SwNew[i,j,k] > ( 1 - ResSwc ) ):
SwNew[i,j,k] = ( 1 - ResSwc )
# SwNew[:,:,:]
return SwNew
The as-was code was executed in about ~ 1,824,168 [us] ~ 1.8 [s]
.
In-place modifications are always faster, than having many same-sized intermediate instances to collect final results, which is another high-performance grade code anti-pattern.
Just the removal of the last line return SwNew
yields ~ 714,977 [us] ~ 0.7 [s]
For a unique performance on trivial [i,j,k]
-mapable transformations may like to try the brave numpy
-vectorised computing.
The whole trick is this short : np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ))
Your function can run anywhere between 12086 [us]
and 97810 [us]
in numpy
-vectorised mode. The physical RAM, cache, CPU, O/S workloads' details will cause variable effects, yet RAM-sizes matter for a smart-vectorised code, and the numpy
is smart and a lot, the most:
A[1000,1000,1000] ~ 8 [GB]
-RAM-footprints. Yet details matter. A lot.
A[ 100, 100, 100] ~ 8 [MB]
-RAM-footprints. Now it does fit in L3/L2 cache ... that matters...
A[ 10, 10, 10] ~ 8 [kB]
-RAM-footprints. Now it does fit in L1d-cache ... that matters. A lot...
We start here with a 2D-reduced dimensionality, because of the RAM.
>>> from zmq import Stopwatch; aClk = Stopwatch() # a [us]-resolution clock
>>> pass; import gc; gc.disable() # ... elementary
>>> pass; import numpy as np
#_______________________________________________________________________________
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N, N ) ); aClk.stop()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mtrand.pyx", line 861, in mtrand.RandomState.random_sample
File "mtrand.pyx", line 167, in mtrand.cont0_array
MemoryError
Nevertheless, the vectorised code trick remains the same in principle for 1D- ... nD-tensor processing :
#_______________________________________________________________________________
# [us]
#_______________________________________________________________________________
>>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
17801992
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
184895
>>> N = int( 1E2 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
1585
>>> N = int( 1E1 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
44
>>> N = int( 1E2 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
465
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
48651
>>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
4954694
>>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
25549190
#_______________________________________________________________________________
# [us] SEE THE GROWING COSTS FOR ram-ALLOCATIONS & STORAGE OF RANDOM num-s
#_______________________________________________________________________________
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
471956
50067
49184
42891
48897
52639
45464
48828
#_______________________________________________________________________________
# [us] SEE THE 1st, resp. WAY-LOWER COSTS FOR 1st, resp. OTHER ram-ACCESS
#_______________________________________________________________________________
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
71044
12086
16277
28192
#_______________________________________________________________________________
# [us] SEE ALSO THE "noise" IN THE LATENCY-COSTS IN 2nd+ RE-RUN-s
#_______________________________________________________________________________
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
45759
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
38362
46640
37927
>>> N = int( 1E4 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
4885472
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
35747022
#_______________________________________________________________________________
# [us] SEE THE SHIFT IN LATENCY-COSTS AS ram-i/o COSTS DOMINATE > 1 [GB]
#_______________________________________________________________________________
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
2307509
50446
49464
43006
50958
54800
43418
57091
52135
46451
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
97810
20117
14480
22696
31172
14846
#_______________________________________________________________________________
# [us] SEE THE "noise" FOR 2nd+ RE-RUN-s
#_______________________________________________________________________________
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
47437
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
39298
19422
#_______________________________________________________________________________
# [us] SEE THE "noise" FOR 2nd+ RE-RUN
#_______________________________________________________________________________
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
44814
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
42565
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
43120
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
38039
>>> N = int( 1E3 ); aClk.start(); A = np.random.random( ( N, N ) ); aClk.stop()
45296
>>> T_lo = 0.25; T_hi = 1 - T_lo; aClk.start(); A = np.where( A > T_hi, T_hi, np.where( A < T_lo, T_lo, A ) ); aClk.stop()
41898
#_______________________________________________________________________________
# [us] SEE THE "noise" FOR block-RE-randomised-RE-RUN-s
#_______________________________________________________________________________