Search code examples
numpyparallel-processingmmap

mmap on multiple nodes


The script below writes in parallel to a memory mapped array with mmap. However it only works when all processes are on the same node - otherwise it produces rows of 0 for processors not on the rank 0 node, or other stray zeros in the output. Why is this? I feel I am missing something about how mmap works.

Edit: The same result occurs on both a NFS system, and on a parallel distributed system. A commenter below suggested it was to do with the page length of mmap. When the 'length' of my slice is exactly 4KiB the script still produces the wrong output. The same also occurs when the slices are much longer than 4 KiB.

#!/usr/bin/python3

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

length = int(1e6)      # Edited to make test case longer.
myfile = "/tmp/map"

if rank == 0:
    fp = np.memmap(myfile, dtype=np.float32, mode='w+', shape=(size,length))
    del fp

comm.Barrier()

fp = np.memmap(myfile, dtype=np.float32, mode='r+', shape=(1,length),
                offset=rank*length*4)
fp[:,:] = np.full(length,rank)

comm.Barrier()

if rank == 0:
    out = np.memmap(myfile, dtype=np.float32, mode='r', shape=(size,length))
    print(out[:,:])

Correct output:

[[ 0.  0.  0.  0.]
 [ 1.  1.  1.  1.]
 [ 2.  2.  2.  2.]
 [ 3.  3.  3.  3.]
 [ 4.  4.  4.  4.]]

Incorrect output. Processors with rank 3 and 4 do not write.

[[ 0.  0.  0.  0.]
 [ 1.  1.  1.  1.]
 [ 2.  2.  2.  2.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]

Solution

  • This answer applies to NFS files. YMMV on other networked filesystems.

    The problem is not related to MPI or numpy.memmap but to how the Linux kernel caches NFS file data. As far as I can tell from some experimentation, before requesting a read from the NFS server, the client requests the last-modified timestamp. If this timestamp is not more recent than the last write by the client, the data will be taken from the client's cache rather than be requested from the server again. If N1 and N2 are nodes, the following may happen:

    1. N1 and N2 open the same zero-filled file; file content [00], last-modified: t=0.00.
    2. The N1 and N2 kernels request more of the file content than needed and stores it in cache. N1 cache: [00] (t=0.00); N2 cache: [00] (t=0.00).
    3. At time t=0.01, N2 writes to the second half of the file. Server state: [02] (t=0.01); N1 cache: [00] (0.00); N2 cache: [02] (0.01).
    4. At time t=0.02, N1 writes to the first half. Server: [12] (0.02); N1 cache: [10] (0.02). N2 cache: [02] (0.01).
    5. The process on N1 tries to read the second half.
    6. The N1 kernel requests the last-modified time from the server; result: t=0.02.
    7. The N1 kernel retrieves the stale cache content [0] for the first half.

    Therefore, you need to ensure that other clients (nodes) update the timestamp.

    if rank == 0:
        fp = np.memmap(myfile, dtype=np.float32, mode='w+', shape=(size,length))
        del fp
    
    comm.Barrier() # B0
    fp = np.memmap(myfile, dtype=np.float32, mode='r+', shape=(1,length),
                    offset=rank*length*4)
    
    comm.Barrier() # B1 (this one may be superfluous)
    fp[:,:] = np.full(length, rank)
    del fp # this will flush the changes to storage
    
    comm.Barrier() # B2
    from pathlib import Path
    from time import sleep
    if rank == 1:
        # make sure that another node updates the timestamp
        # (assuming 1 s granularity on NFS timestamps)
        sleep(1.01)
        Path(myfile).touch()
        sleep(0.1) # not sure
    
    comm.Barrier() # B3
    if rank == 0:
        out = np.memmap(myfile, dtype=np.float32, mode='r', shape=(size,length))
        print(out[:,:])
    

    About the barrier B1: I don't have MPI set up here; I simulated it with keypresses. I'm not sure that this barrier is really necessary. The sleep(0.1) may also not be necessary; it's just there in case of any latency between the touch() function returning and the NFS server receiving the update.

    I have assumed that you arranged the data such that each node accesses parts of the memory-mapped file aligned to 4096-byte boundaries. I tested with length=4096.

    This solution is a bit of a hack, relying on undocumented behavior of the NFS driver. This was on Linux kernel 3.10.0-957 with NFS mount options including relatime,vers=3,rsize=8192,wsize=8192. If you use this approach, I recommend that you include a self-test: basically, the above code with an assert statement to verify the output. That way, you will catch it if it stops working due to a different filesystem.