python numpy parallel-processing shared-memory python-multiprocessing

Writing to Shared Memory Making Copies?

I'm attempting to parallelize a program that reads chunked numpy arrays over the network using shared memory. It seems to work (my data comes out the other side), but the memory of all my child processes is blowing up to about the size of the shared memory (~100-250MB each) and it happens when I write to it. Is there some way to avoid these copies being created? They seem to be unnecessary since the data is propagating back to the actual shared memory array.

Here's how I've set up my array using posix_ipc, mmap, and numpy (np):

shared = posix_ipc.SharedMemory(vol.uuid, flags=O_CREAT, size=int(nbytes))
array_like = mmap.mmap(shared.fd, shared.size)
renderbuffer = np.ndarray(buffer=array_like, dtype=vol.dtype, shape=mcshape)

The memory increases when I do this:

renderbuffer[ startx:endx, starty:endy, startz:endz, : ] = 1

Thanks for your help!

Solution

Your actual data has 4 dimensions, but I'm going to work through a simpler 2D example.

Imagine you have this array (renderbuffer):

  1   2   3   4   5
  6   7   8   9  10
 11  12  13  14  15
 16  17  18  19  20

Now imagine your startx/endx/starty/endy parameters select this slice in one process:

  8   9
 13  14
 18  19

The entire array is 4x5 times 8 bytes, so 160 bytes. The "window" is 3x2 times 8 bytes, so 48 bytes.

Your expectation seems to be that accessing this 48 byte slice would require 48 bytes of memory in this one process. But actually it requires closer to the full 160 bytes. Why?

The reason is that memory is mapped in pages, which are commonly 4096 bytes each. So when you access the first element (here, the number 8), you will map the entire page of 4096 bytes containing that element.

Memory mapping on many systems is guaranteed to start on a page boundary, so the first element in your array will be at the beginning of a page. So if your array is 4096 bytes or smaller, accessing any element of it will map the entire array into memory in each process.

In your actual use case, each element you access in the slice will result in mapping the entire page which contains it. Elements adjacent in memory (meaning either the first or the last index in the slice increments by one, depending on your array order) will usually reside in the same page, but elements which are adjacent in other dimensions will likely be in separate pages.

But take heart: the memory mappings are shared between processes, so if your entire array is 200 MB, even though each process will end up mapping most or all of it, the total memory usage is still 200 MB combined for all processes. Many memory measurement tools will report that each process uses 200 MB, which is sort of true but useless: they are sharing a 200 MB view of the same memory.