I want to share a large object among processes(which is a list of list of str) among processes in the process Pool. The object is read-only, so I don't want a lock on that. I've tried multiprocessing.Value, but it seems that it only supports ctypes, but I want to share a list of list of str.
Also I've tried multiprocessing.Manager.list, but according to the document, the manager is a sync manager, I suppose it will have a lock on that, which is not what I want.
So what's the best practice to do this?
It depends on what tradeoffs you are willing to make.
I can see multiple ways of doing this, with advantages and disadvantages:
mmap
. These are specifically designed to be shared between processes created with multiprocessing
or os.fork()
. They have low overhead and translate almost directly into operating system primitives for shared memory. The downside is that you just get one huge fixed-length array of bytes. If you want additional structure on top of that (for example, a list of lists of strings), you need to manually serialize and deserialize it. You may find the struct
and array
modules helpful for that purpose. If you're feeling adventurous, you can also access the elements in-place through a memoryview
object.fork()
. In practice, that does nothing for us because Python reference-counts the strings, which writes to the memory and forces the OS to copy nearby data. array
doesn't refcount its contents and therefore might be less susceptible to this issue, if your individual arrays are large enough.tempfile
and store the information there using json
, pickle
, or sqlite3
. We may assume the temporary file is also visible to child processes, and the tempfile
module will take care of cleaning it up for you when finished. However, reading data from permanent storage is typically slower than in-memory solutions.