Search code examples
pythonmultiprocessingpython-multiprocessingray

Python ray memory issue when runing multiple tasks with large arguments


When calling multiple function.remote() with large arguments, there seems to be memory issue.

@ray.remote 
def function(args):
    # do something

args={} # big dictionary 
args=ray.put(args)

[function.remote(args) for _ in range(1000)]

When i ran codes like the above, ram usage kept increasing and memory leak problem occurred. From what i know, "ray.put" method writes "args" to shared memory. Therefore every process consuming the function accesses "args" in shared memory instead of copying args to each process. If it did, memory usage would not increase.

Is there anything i am confused?


Solution

  • Tl;DR

    Can you use a numpy array or other arrow-compatible datatype instead of your args dict? dicts will incur a deserialization cost + additional memory in which to store the deserialized value, for every concurrent usage of the value.

    Details

    Ray's support for zero-copy semantics seems to have been built with numpy + other data types that are supported by apache arrow as first-class citizens. It is technically feasible to add zero-copy support for dict and other types, but Ray would need to subclass these types so that mutations invoke a copy-on-write mechanism. Or, otherwise provide a zero-copy deserialization mechanism from the object store to the python heap. Maybe a good idea to make Ray better :)

    Furthermore, Ray doesn't currently have good memoization support. So even if you put two different identical dicts into the object store, you'll get two object references.

    If having these features would unblock you, I'd recommend creating a GitHub issue in the Ray repository.

    Source: I asked Stephanie Wang, a co-author of the original Ray paper, this question.