python-3.x multiprocessing python-multiprocessing

python3 shared list object among processes

I want to share a large object among processes(which is a list of list of str) among processes in the process Pool. The object is read-only, so I don't want a lock on that. I've tried multiprocessing.Value, but it seems that it only supports ctypes, but I want to share a list of list of str.

Also I've tried multiprocessing.Manager.list, but according to the document, the manager is a sync manager, I suppose it will have a lock on that, which is not what I want.

So what's the best practice to do this?

Solution

It depends on what tradeoffs you are willing to make.

I can see multiple ways of doing this, with advantages and disadvantages:

Create an anonymous mmap. These are specifically designed to be shared between processes created with multiprocessing or os.fork(). They have low overhead and translate almost directly into operating system primitives for shared memory. The downside is that you just get one huge fixed-length array of bytes. If you want additional structure on top of that (for example, a list of lists of strings), you need to manually serialize and deserialize it. You may find the struct and array modules helpful for that purpose. If you're feeling adventurous, you can also access the elements in-place through a memoryview object.
Don't share the list. The child processes already inherit a copy of anything in the parent process's memory. Since the list is read-only, this may have performance implications but will not produce incorrect results. In theory, these performance implications ought to be mitigated by the copy-on-write design of a modern operating system's fork(). In practice, that does nothing for us because Python reference-counts the strings, which writes to the memory and forces the OS to copy nearby data. array doesn't refcount its contents and therefore might be less susceptible to this issue, if your individual arrays are large enough.
Create a temporary file with tempfile and store the information there using json, pickle, or sqlite3. We may assume the temporary file is also visible to child processes, and the tempfile module will take care of cleaning it up for you when finished. However, reading data from permanent storage is typically slower than in-memory solutions.