Python Ray seems to copy objects for every remote function call

This should be a simple one.

I have a huge dataset and I need to run a simulation multiple times, going through this dataset over and over again, read-only. I wanted to run these simulations in parallel and as I can't have this dataset loaded in every process (it's over 5GB), I wanted to use Ray "Shared Memory" functionality (I could try multiprocessing as well but Ray seemed to be easier and faster).

The code below is basically a copy from most examples I could find about it.

def run_simulation_parallel():
    proc_list = []
    list_id = ray.put(huge_list) # 5GB+ list, every position has a dictionary
    for i in range(10):
        proc_list.append(simulation.remote(i, list_id)) # create multiple processes
    results = ray.get(proc_list)

@ray.remote
def simulation(i, list_id):

    time.sleep(60) # do nothing, just keep the process alive
    return

When I run the code above, I can see through task manager that every new process is building up to 5GB+, meaning it's loading the whole dataset multiple times.

I've seen people saying this is the intended use case for Ray (e.g. Shared-memory objects in multiprocessing, Robert Nishihara answer). So this should be possible, but every example is the same as my code. What am I missing here?

Using python 3.9, pycharm, windows 11.

Edit: I tried replacing the dataset (list of dictionaries) with a simple array full of ones, now the processes are not consuming as much RAM as the main one. Can Ray really store objects that are not array in shared memory?

Solution

Ray can support zero copy serialization when the object created by ray.put is numeric numpy array (See https://docs.ray.io/en/master/ray-core/objects/serialization.html#numpy-arrays), or other zero copyable object that supports pickle 5 out-of-band serialization (https://peps.python.org/pep-0574/).

Also note that although zero-copy is enabled, your total per-proc mem usage from top or htop includes shared memory usage. You can verify this by checking SHR column of htop from the per-proc memory usage. (So if your RES usage is 5 GB, and SHR is 4GB, the real mem usage is just 1 GB)

Also you can consider using Ray dataset to load data in zero-copyable format (https://docs.ray.io/en/master/data/dataset.html).