I'm using shared memory to share a large numpy array (write-once, read-many) with mpi4py, utilising shared windows. I am finding that I can set up the shared array without problem, however if I try to access the array on any process which is not the lead process, then my memory usages spikes beyond reasonable limits. I have a simple code snippet which illustrates the application here:
from mpi4py import MPI
import numpy as np
import time
import sys
shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
is_leader = shared_comm.rank == 0
# Set up a large array as example
_nModes = 45
_nSamples = 512*5
float_size = MPI.DOUBLE.Get_size()
size = (_nModes, _nSamples, _nSamples)
if is_leader:
total_size = np.prod(size)
nbytes = total_size * float_size
else:
nbytes = 0
# Create the shared memory, or get a handle based on shared communicator
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array
buf, itemsize = win.Shared_query(0)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)
# Fill the shared array with only the leader rank
if is_leader:
_storedZModes[...] = np.ones(size)
shared_comm.Barrier()
# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice the size, as if it's copying the array on access
if shared_comm.rank == 1:
# Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
SUM = 0.
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
SUM = SUM + _storedZModes[i,j,k]
# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)
With the above set up, the shared array should take about 2.3GB, which is confirmed when running the code and querying it. If I submit to a queue via slurm on 4 cores on a single node, with 0.75GB per process it runs ok only if I don't do the sum. However, if if do the sum (as shown, or using np.sum or similar), then slurm complains that the memory usage is exceeded. This does not happen if the leader rank does the sum.
With 0.75GB per process, the total memory allocated is 3GB, which gives about 0.6GB for everything else other than the shared array. This should clearly be plenty.
It seems that accessing the memory on any process other than the leader is copying the memory, which is clearly useless. Have I done something wrong?
EDIT
I've toyed with window fencing, and using put/get as below. I still get the same behaviour. If anyone runs this and doesn't replicate the problem, that is still useful info for me :)
from mpi4py import MPI
import numpy as np
import time
import sys
shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
print("Shared comm contains: ", shared_comm.Get_size(), " processes")
shared_comm.Barrier()
leader_rank = 0
is_leader = shared_comm.rank == leader_rank
# Set up a large array as example
_nModes = 45
_nSamples = 512*5
float_size = MPI.DOUBLE.Get_size()
print("COMM has ", shared_comm.Get_size(), " processes")
size = (_nModes, _nSamples, _nSamples)
if is_leader:
total_size = np.prod(size)
nbytes = total_size * float_size
print("Expected array size is ", nbytes/(1024.**3), " GB")
else:
nbytes = 0
# Create the shared memory, or get a handle based on shared communicator
shared_comm.Barrier()
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array
buf, itemsize = win.Shared_query(leader_rank)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)
# Fill the shared array with only the leader rank
win.Fence()
if is_leader:
print("RANK: ", shared_comm.Get_rank() , " is filling the array ")
#_storedZModes[...] = np.ones(size)
win.Put(np.ones(size), leader_rank, 0)
print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY filled the array ")
print("Sum should return ", np.prod(size))
win.Fence()
# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice t
he size, as if it's copying the array on access
if shared_comm.rank == 1:
print("RANK: ", shared_comm.Get_rank() , " is querying the array "); sys.stdout.flush()
# Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
SUM = 0.
counter = -1; tSUM = np.empty((1,))
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
if counter%10000 == 0:
print("Finished iteration: ", counter); sys.stdout.flush()
counter += 1; win.Get(tSUM, leader_rank, counter); SUM += tSUM[0];
#SUM = SUM + _storedZModes[i,j,k]
print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY queried the array ", SUM)
shared_comm.Barrier()
# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)
Answer
Further investigation made clear that the issue was in slurm: a switch which effectively told slurm to ignore shared memory was turned off, and turning it on solved this.
A description of why this would cause a problem is given in the accepted answer. Essentially, slurm was counting the total resident memory on both processes.
I ran this with two MPI tasks and monitored both of them with top
and pmap
.
These tools suggests that
_storedZModes[...] = np.ones(size)
does allocated a buffer filled with 1
, so the memory required on the leader is indeed 2 * nbytes
(resident memory is 2 * nbytes
, which includes nbytes
in shared memory).
From top
top - 15:14:54 up 43 min, 4 users, load average: 2.76, 1.46, 1.18
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.5 us, 6.2 sy, 0.0 ni, 66.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 3881024 total, 161624 free, 2324936 used, 1394464 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 1258976 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6390 gilles 20 0 2002696 20580 7180 R 100.0 0.5 1:00.39 python
6389 gilles 20 0 3477268 2.5g 1.1g D 12.3 68.1 0:02.41 python
Once this operation completes, the buffer filled with 1
is freed and the memory drops to nbytes
(resident memory ~= shared memory)
Note at that time, both resident and shared memory is very small on task 1.
top - 15:14:57 up 43 min, 4 users, load average: 2.69, 1.47, 1.18
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.2 us, 1.3 sy, 0.0 ni, 71.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 3881024 total, 1621860 free, 848848 used, 1410316 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 2735168 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6390 gilles 20 0 2002696 20580 7180 R 100.0 0.5 1:03.39 python
6389 gilles 20 0 2002704 1.1g 1.1g S 2.0 30.5 0:02.47 python
When the sum is computed on task 1, both resident and shared memory increase to nbytes
.
top - 15:18:09 up 46 min, 4 users, load average: 0.33, 1.01, 1.06
Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.4 us, 2.9 sy, 0.0 ni, 88.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3881024 total, 1297172 free, 854460 used, 1729392 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 2729768 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6389 gilles 20 0 2002704 1.1g 1.1g S 0.0 30.6 0:02.48 python
6390 gilles 20 0 2002700 1.4g 1.4g S 0.0 38.5 2:34.42 python
At the end, top
reports two processes with roughly nbytes
of resident memory, which is roughly a single mapping of the same nbytes
in shared memory.
I do not know how SLURM measures the memory consumption ...
If it correctly takes into account the shared memory, then it should be fine (e.g. nbytes
allocated).
But if ignores it, it will consider your job allocated 2 * nbytes
of (resident) memory and that will be likely too much.
Note that if you replace the initialization with
if is_leader:
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
_storedZModes[i,j,k] = 1
a temporary buffer filled with 1
is not allocated and the max memory consumption on rank 0 is nbytes
instead of 2 * nbytes
.