MPI4PY shared memory - memory usage spike on access

I'm using shared memory to share a large numpy array (write-once, read-many) with mpi4py, utilising shared windows. I am finding that I can set up the shared array without problem, however if I try to access the array on any process which is not the lead process, then my memory usages spikes beyond reasonable limits. I have a simple code snippet which illustrates the application here:

from mpi4py import MPI
import numpy as np
import time
import sys

shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)

is_leader = shared_comm.rank == 0

# Set up a large array as example
_nModes = 45
_nSamples = 512*5

float_size = MPI.DOUBLE.Get_size()

size = (_nModes, _nSamples, _nSamples)
if is_leader:
    total_size = np.prod(size)
    nbytes = total_size * float_size
else:
    nbytes = 0

# Create the shared memory, or get a handle based on shared communicator                                                                                                                                           
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array                                                                                                                                                                                              
buf, itemsize = win.Shared_query(0)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)

# Fill the shared array with only the leader rank
if is_leader:
    _storedZModes[...] = np.ones(size)

shared_comm.Barrier()

# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice the size, as if it's copying the array on access
if shared_comm.rank == 1:
    # Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
    SUM = 0.
    for i in range(_nModes):
        for j in range(_nSamples):
            for k in range(_nSamples):
                SUM = SUM + _storedZModes[i,j,k]                                                                                                                                               

# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)

With the above set up, the shared array should take about 2.3GB, which is confirmed when running the code and querying it. If I submit to a queue via slurm on 4 cores on a single node, with 0.75GB per process it runs ok only if I don't do the sum. However, if if do the sum (as shown, or using np.sum or similar), then slurm complains that the memory usage is exceeded. This does not happen if the leader rank does the sum.

With 0.75GB per process, the total memory allocated is 3GB, which gives about 0.6GB for everything else other than the shared array. This should clearly be plenty.

It seems that accessing the memory on any process other than the leader is copying the memory, which is clearly useless. Have I done something wrong?

EDIT

I've toyed with window fencing, and using put/get as below. I still get the same behaviour. If anyone runs this and doesn't replicate the problem, that is still useful info for me :)

from mpi4py import MPI
import numpy as np
import time
import sys

shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
print("Shared comm contains: ", shared_comm.Get_size(), " processes")

shared_comm.Barrier()

leader_rank = 0
is_leader = shared_comm.rank == leader_rank

# Set up a large array as example
_nModes = 45
_nSamples = 512*5

float_size = MPI.DOUBLE.Get_size()

print("COMM has ", shared_comm.Get_size(), " processes")

size = (_nModes, _nSamples, _nSamples)
if is_leader:
    total_size = np.prod(size)
    nbytes = total_size * float_size
    print("Expected array size is ", nbytes/(1024.**3), " GB")
else:
    nbytes = 0

# Create the shared memory, or get a handle based on shared communicator                                                                  

shared_comm.Barrier()                      
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array                                                                                                                     

buf, itemsize = win.Shared_query(leader_rank)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)

# Fill the shared array with only the leader rank
win.Fence()
if is_leader:
    print("RANK: ", shared_comm.Get_rank() , " is filling the array ")
    #_storedZModes[...] = np.ones(size)
    win.Put(np.ones(size), leader_rank, 0)
    print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY filled the array ")
    print("Sum should return ", np.prod(size))
win.Fence()

# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice t
he size, as if it's copying the array on access
if shared_comm.rank == 1:
    print("RANK: ", shared_comm.Get_rank() , " is querying the array "); sys.stdout.flush()
    # Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
    SUM = 0.
    counter = -1; tSUM = np.empty((1,))
    for i in range(_nModes):
        for j in range(_nSamples):
            for k in range(_nSamples):
                if counter%10000 == 0:
                    print("Finished iteration: ", counter); sys.stdout.flush()
                counter += 1; win.Get(tSUM, leader_rank, counter); SUM += tSUM[0];
                #SUM = SUM + _storedZModes[i,j,k]                                                                                         

    print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY queried the array ", SUM)

shared_comm.Barrier()

# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)

Answer

Further investigation made clear that the issue was in slurm: a switch which effectively told slurm to ignore shared memory was turned off, and turning it on solved this.

A description of why this would cause a problem is given in the accepted answer. Essentially, slurm was counting the total resident memory on both processes.

Solution

I ran this with two MPI tasks and monitored both of them with top and pmap.

These tools suggests that

_storedZModes[...] = np.ones(size)

does allocated a buffer filled with 1, so the memory required on the leader is indeed 2 * nbytes (resident memory is 2 * nbytes, which includes nbytes in shared memory).

From top

top - 15:14:54 up 43 min,  4 users,  load average: 2.76, 1.46, 1.18
Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.5 us,  6.2 sy,  0.0 ni, 66.2 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  3881024 total,   161624 free,  2324936 used,  1394464 buff/cache
KiB Swap:   839676 total,   818172 free,    21504 used.  1258976 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6390 gilles    20   0 2002696  20580   7180 R 100.0  0.5   1:00.39 python
 6389 gilles    20   0 3477268   2.5g   1.1g D  12.3 68.1   0:02.41 python

Once this operation completes, the buffer filled with 1 is freed and the memory drops to nbytes (resident memory ~= shared memory)

Note at that time, both resident and shared memory is very small on task 1.

top - 15:14:57 up 43 min,  4 users,  load average: 2.69, 1.47, 1.18
Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.2 us,  1.3 sy,  0.0 ni, 71.3 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  3881024 total,  1621860 free,   848848 used,  1410316 buff/cache
KiB Swap:   839676 total,   818172 free,    21504 used.  2735168 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6390 gilles    20   0 2002696  20580   7180 R 100.0  0.5   1:03.39 python
 6389 gilles    20   0 2002704   1.1g   1.1g S   2.0 30.5   0:02.47 python

When the sum is computed on task 1, both resident and shared memory increase to nbytes.

top - 15:18:09 up 46 min,  4 users,  load average: 0.33, 1.01, 1.06
Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.4 us,  2.9 sy,  0.0 ni, 88.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3881024 total,  1297172 free,   854460 used,  1729392 buff/cache
KiB Swap:   839676 total,   818172 free,    21504 used.  2729768 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6389 gilles    20   0 2002704   1.1g   1.1g S   0.0 30.6   0:02.48 python
 6390 gilles    20   0 2002700   1.4g   1.4g S   0.0 38.5   2:34.42 python

At the end, top reports two processes with roughly nbytes of resident memory, which is roughly a single mapping of the same nbytes in shared memory.

I do not know how SLURM measures the memory consumption ... If it correctly takes into account the shared memory, then it should be fine (e.g. nbytes allocated). But if ignores it, it will consider your job allocated 2 * nbytes of (resident) memory and that will be likely too much.

Note that if you replace the initialization with

if is_leader:
    for i in range(_nModes):
        for j in range(_nSamples):
            for k in range(_nSamples):
                _storedZModes[i,j,k] = 1

a temporary buffer filled with 1 is not allocated and the max memory consumption on rank 0 is nbytes instead of 2 * nbytes.