python multidimensional-array parallel-processing numba

Integrating Out Dimension from MultiDimensional Array using Parallel Processing

I was hoping to find some clever approaches to solving a parallel-processing problem I've been struggling with. Basically, I am dealing with 20,160 multidimensional arrays with size (72,35,25,20). Currently, I'm integrating out the dimension with size 72 by simply doing a trapezoidal integration in a nested for-loop. My end goal is to get an output array with size (20160,35,25,20).

for idx,filename in enumerate(filenames):
    #Read NetCDF Data File as 'raw_data'
    flux=raw_data['FluxHydrogen'][:]   #This is size (72,35,25,20)
    PA=raw_data['PitchAngleGrid'][:]   #This is size (72)
    for i in range(35):
        for j in range(25):
            for k in range(20):
                dir_flux=flux[:,i,j,k]
                omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
                data[idx,i,j,k]=omni_flux   #This will have size (20160,35,25,20)

I believe it would be most beneficial to implement the parallelization lower in the nested for-loop but can't seem to figure out how. I have searched for common questions, but none [that I have found] provide enough insight into how to implement shared memory, pass multidimensional arrays to the pools, and/or reshape the resulting array. Any help or insight would be greatly appreciated.

Solution

You can use Numba so to speed up this code by a large margin. Numba is a JIT compiler that is able to compile Numpy-based code to fast native codes (so loops are not a issue with it, in fact this is a good idea to use loops in Numba).

The first thing to do is to pre-compute np.sin(PA) once so to avoid repeated computations. Then, dir_flux * np.sin(PA) can be computed using a for loop and the result can be stored in a pre-allocated array so not to perform millions of expensive small array allocations. The outer loop can be executed using multiple threads using prange and the Numba flag parallel=True. It can be further accelerated using the flag fastmath=True assuming the input values are not special (like NaN or Inf or very very small: see subnormal numbers).

While this should theoretically be enough to get a fast code, the current implementation of np.trapz is not efficient as it performs expensive allocations. One can easily rewrite the function so not to allocate any additional arrays.

Here are the resulting code:

import numpy as np
import numba as nb

@nb.njit('(float64[::1], float64[::1])')
def trapz(y, x):
    s = 0.0
    for i in range(x.size-1):
        dx = x[i+1] - x[i]
        dy = y[i] + y[i+1]
        s += dx * dy
    return s * 0.5

@nb.njit('(float64[:,:,:,:], float64[:])', parallel=True)
def compute(flux, PA):
    sl, si, sj, sk = flux.shape
    assert sl == PA.size
    data = np.empty((si, sj, sk))
    flattenPA = np.ascontiguousarray(PA)
    sinPA = np.sin(flattenPA)
    for i in nb.prange(si):
        tmp = np.empty(sl)
        for j in range(sj):
            for k in range(sk):
                dir_flux = flux[:, i, j, k]
                for l in range(sl):
                    tmp[l] = dir_flux[l] * sinPA[l]
                omni_flux = trapz(tmp, flattenPA)
                data[i, j, k] = omni_flux
    return data

for idx,filename in enumerate(filenames):
    # Read NetCDF Data File as 'raw_data'
    flux=raw_data['FluxHydrogen'][:]   #This is size (72,35,25,20)
    PA=raw_data['PitchAngleGrid'][:]   #This is size (72)
    data[idx] = compute(flux, PA)

Note flux and PA must be Numpy arrays. Also note that trapz is accurate as long as len(PA) is relatively small and np.std(PA) is not huge. Otherwise a pair-wise summation or even a (paranoid) Kahan summation should help (note Numpy use a pair-wise summation). In practice, results are the same on random normal numbers.

Further optimizations

The code can be made even faster by making flux accesses more contiguous. An efficient transposition can be used to do that (the one of Numpy is not efficient). However, this is not simple to do on 4D arrays. Another solution is to compute the trapz operation on whole lines of the k dimension. This makes the computation very efficient and nearly memory-bound on my machine. Here is the code:

@nb.njit('(float64[:,:,:,:], float64[:])', fastmath=True, parallel=True)
def compute(flux, PA):
    sl, si, sj, sk = flux.shape
    assert sl == PA.size
    data = np.empty((si, sj, sk))
    sinPA = np.sin(PA)
    premultPA = PA * 0.5
    for i in nb.prange(si):
        for j in range(sj):
            dir_flux = flux[:, i, j, :]
            data[i, j, :].fill(0.0)
            for l in range(sl-1):
                dx = premultPA[l+1] - premultPA[l]
                fact1 = dx * sinPA[l]
                fact2 = dx * sinPA[l+1]
                for k in range(sk):
                    data[i, j, k] += fact1 * dir_flux[l, k] + fact2 * dir_flux[l+1, k]
    return data

Note the premultiplication make the computation slightly less precise.

Results

Here are results on random numbers (like @DominikStańczak used) on my 6-core machine (i5-9600KF processor):

Initial sequential solution:                       193.14 ms  (±  1.8 ms)
DominikStańczak sequential vectorized solution:      8.68 ms  (± 48.1 µs)
Numba parallel solution without fastmath:            0.48 ms  (±  6.7 µs)
Numba parallel solution without fastmath:            0.38 ms  (±  9.5 µs)
Best Numba solution (with fastmath):                 0.32 ms  (±  5.2 µs)
Optimal lower-bound execution:                       0.24 ms  (RAM bandwidth saturation)

Thus, the fastest Numba version is 27 times faster than the (sequential) version of @DominikStańczak and 604 times faster than the initial one. It is nearly optimal.