python-3.x multiprocessing networkx shared-memory readonly

Sharing NetworkX graph between processes with no additional memory cost (read-only)

I am using python's multiprocessing module. I have a networkx graph which I wish to share between many sub processes. These subprocesses do not modify the graph in any way, and only read its attributes (nodes, edges, etc). Right now every subprocess has its own copy of the graph, but I am looking for a way to share the graph between all of them, which will result in the memory footprint of the entire program being reduced. Since the computations are very CPU-intensive, I would want this to be done in a way that would not cause big performance issues (avoiding locks if possible, etc).

Note: I want this to work on various operating systems, including Windows, which means COW does not help (if I understand this correctly, it probably wouldn't have helped regardless, due to reference counting)

I found https://docs.python.org/3/library/multiprocessing.html#proxy-objects and https://docs.python.org/3/library/multiprocessing.shared_memory.html, but I'm not sure which (or if either) is suitable. What is the right way to go about this? I'm using python 3.8, but can use later versions if helpful.

Solution

There are a few options for sharing data in python during multiprocessing but you may not be able to do exactly what you want to.

In C++ you could use simple shared memory for ints, floats, structs, etc.. Python's shared memory manager does allow for this type of sharing for simple objects but it doesn't work for classes or anything more complex than a list of base types. For shared complex python objects, you really only have a few choices...

Create a copy of the object in your forked process (which it sounds like you don't want to do).
Put the object in a centralized process (ie.. python's Manager / proxy objects) and interact with it via pipes and pickled data.
Convert your networkX graph to a list of simple ints and put it in shared memory.

What works for you is going to depend on some specifics. Option #2 has a bit of overhead because every time you need to access the object, data has to be pickled and piped to the centralized process and the result pickled/piped for return. This works well if you only need a small portion of the centralized data at a time and your processing steps are relatively long (compared to the pickle/pipe time).

Option #3 could be a lot of work. You would fundamentally be changing the data format from networkX objects to a list of ints so it's going to change the way you do processing a lot.

A while back I put together PythonDataServe which allows you to server your data to multiple processes from another process. It's a very similar solution to #2 above. This type of approach works if you only need a small portion of the data at a time but it you need it all, it's much easier to just create a local copy.