Troubleshooting MPI4py Error When Using Numba-Accelerated Python Code

I'm attempting to accelerate a piece of Python code using Numba. Within this code, I'm also employing mpi4py for parallelization. However, I've encountered an error. I've attempted to provide a minimal error-reproducing example below:

import numpy as np
from numba import njit, prange
from mpi4py import MPI

#####MPI setting#####
comm=MPI.COMM_WORLD
rank=comm.Get_rank()
size=comm.Get_size()
N_theta_to_scan=1000
n_per_proc=int(N_theta_to_scan/size)
n_more=int(N_theta_to_scan%size)
if rank<n_more:
    start=rank*(n_per_proc+1)
    number_to_cal=n_per_proc+1
else:
    start=n_more*(n_per_proc+1)+(rank-n_more)*n_per_proc
    number_to_cal=n_per_proc

#####function#####
@njit(parallel=True)
def func1(na,nb,nc):
    to_sum=np.zeros(na*nb*nc)
    for a in prange(0,na):
        for b in prange(0,nb):
            for c in prange(0,nc):
                to_sum[a*nb*nc+b*nc+c]=a*b*c
    out=np.sum(to_sum)
    return out

@njit(parallel=True)
def func2(start,number_to_cal):
    to_sum=np.zeros(number_to_cal)
    for i in prange(start,start+number_to_cal):
        to_sum[i-start]=func1(i,i*i,i*i*i)
    out2=np.sum(to_sum)
    return out2

#####main section#####
to_be_gather=np.array([func2(start,number_to_cal)])
gatheres=np.zeros(0)
comm.Reduce(to_be_gather,gatheres,op=MPI.SUM)

The error occurs for this segment of the code:

MemoryError: Allocation failed (probably too large).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/share/workspace/wuliang/hanlin/test/test.py", line 38, in <module>
    to_be_gather=func2(start,number_to_cal)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
SystemError: CPUDispatcher(<function func2 at 0x2aed19d7b420>) returned a result with an exception set

Next, I attempted to remove func1() and modified func2() by replacing to_sum[i-start]=func1(i,i*i,i*i*i) with to_sum[i-start]=i, which resulted in successful execution. Additionally, I tried running the code with only rank 0 executing the main function, i.e., running the code under the condition if rank==0:, which also ran successfully. I'm wondering where the error lies, and what modifications should I make to achieve the functionality of the current code?

Solution

In reference to the code of original question and to comments regarding coupling Numba and MPI; using MPI from within Numba compiled code (also with parallel=True) is possible with the numba-mpi package.

Here's a compute-pi-in-parallel example from the package README where a reduction is performed within @numba.jit decorated function (using mpi4py, the reduction needs to be done outside of the JIT-compiled code, as in the question):

import numba, numpy as np, numba_mpi

N_TIMES = 10000

@numba.jit
def get_pi_part(n_intervals=1000000, rank=0, size=1):
    h = 1 / n_intervals
    partial_sum = 0.0
    for i in range(rank + 1, n_intervals, size):
        x = h * (i - 0.5)
        partial_sum += 4 / (1 + x**2)
    return h * partial_sum

@numba.jit
def pi_numba_mpi(n_intervals):
    pi = np.array([0.])
    part = np.empty_like(pi)
    for _ in range(N_TIMES):
        part[0] = get_pi_part(n_intervals, numba_mpi.rank(), numba_mpi.size())
        numba_mpi.allreduce(part, pi, numba_mpi.Operator.SUM)

The package README includes a performance comparison with mpi4py.