The script below writes in parallel to a memory mapped array with mmap. However it only works when all processes are on the same node - otherwise it produces rows of 0 for processors not on the rank 0 node, or other stray zeros in the output. Why is this? I feel I am missing something about how mmap works.
Edit: The same result occurs on both a NFS system, and on a parallel distributed system. A commenter below suggested it was to do with the page length of mmap. When the 'length' of my slice is exactly 4KiB the script still produces the wrong output. The same also occurs when the slices are much longer than 4 KiB.
#!/usr/bin/python3
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
length = int(1e6) # Edited to make test case longer.
myfile = "/tmp/map"
if rank == 0:
fp = np.memmap(myfile, dtype=np.float32, mode='w+', shape=(size,length))
del fp
comm.Barrier()
fp = np.memmap(myfile, dtype=np.float32, mode='r+', shape=(1,length),
offset=rank*length*4)
fp[:,:] = np.full(length,rank)
comm.Barrier()
if rank == 0:
out = np.memmap(myfile, dtype=np.float32, mode='r', shape=(size,length))
print(out[:,:])
Correct output:
[[ 0. 0. 0. 0.]
[ 1. 1. 1. 1.]
[ 2. 2. 2. 2.]
[ 3. 3. 3. 3.]
[ 4. 4. 4. 4.]]
Incorrect output. Processors with rank 3 and 4 do not write.
[[ 0. 0. 0. 0.]
[ 1. 1. 1. 1.]
[ 2. 2. 2. 2.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
This answer applies to NFS files. YMMV on other networked filesystems.
The problem is not related to MPI or numpy.memmap
but to how the Linux kernel caches NFS file data. As far as I can tell from some experimentation, before requesting a read from the NFS server, the client requests the last-modified timestamp. If this timestamp is not more recent than the last write by the client, the data will be taken from the client's cache rather than be requested from the server again. If N1 and N2 are nodes, the following may happen:
Therefore, you need to ensure that other clients (nodes) update the timestamp.
if rank == 0:
fp = np.memmap(myfile, dtype=np.float32, mode='w+', shape=(size,length))
del fp
comm.Barrier() # B0
fp = np.memmap(myfile, dtype=np.float32, mode='r+', shape=(1,length),
offset=rank*length*4)
comm.Barrier() # B1 (this one may be superfluous)
fp[:,:] = np.full(length, rank)
del fp # this will flush the changes to storage
comm.Barrier() # B2
from pathlib import Path
from time import sleep
if rank == 1:
# make sure that another node updates the timestamp
# (assuming 1 s granularity on NFS timestamps)
sleep(1.01)
Path(myfile).touch()
sleep(0.1) # not sure
comm.Barrier() # B3
if rank == 0:
out = np.memmap(myfile, dtype=np.float32, mode='r', shape=(size,length))
print(out[:,:])
About the barrier B1
: I don't have MPI set up here; I simulated it with keypresses. I'm not sure that this barrier is really necessary. The sleep(0.1)
may also not be necessary; it's just there in case of any latency between the touch()
function returning and the NFS server receiving the update.
I have assumed that you arranged the data such that each node accesses parts of the memory-mapped file aligned to 4096-byte boundaries. I tested with length=4096
.
This solution is a bit of a hack, relying on undocumented behavior of the NFS driver. This was on Linux kernel 3.10.0-957 with NFS mount options including relatime,vers=3,rsize=8192,wsize=8192
. If you use this approach, I recommend that you include a self-test: basically, the above code with an assert
statement to verify the output. That way, you will catch it if it stops working due to a different filesystem.