Search code examples
python-3.xdaskbinaryfilesdask-delayed

Is it possible limit memory usage by writing to disk?


I cannot understand if what I want to do in Dask is possible...

Currently, I have a long list of heavy files. I am using multiprocessing library to process every entry of the list. My function opens and entry, operates on it, saves the result in a binary file to disk, and returns None. Everything works fine. I did this essentially to reduce RAM usage.

I would like to do "the same" in Dask, but I cannot figure out how to save binary data in parallel. In my mind, it should be something like:

for element in list:
    new_value = func(element)
    new_value.tofile('filename.binary')

where there can only be N elements loaded at once, where N is the number of workers, and each element is used and forgotten at the end of each cycle.

Is it possible?

Thanks a lot for any suggestion!


Solution

  • That does sound like a feasible task:

    from dask import delayed, compute
    
    @delayed
    def myfunc(element):
        new_value = func(element)
        new_value.tofile('filename.binary') # you might want to
        # change the destination for each element...
        
    delayeds = [myfunc(e) for e in list]
    results = compute(delayeds)
    

    If you want fine control over tasks, you might want to explicitly specify the number of workers by starting a LocalCluster:

    from dask.distributed import Client, LocalCluster
    cluster = LocalCluster(n_workers=3)
    client = Client(cluster)
    

    There is a lot more that can be done to customize the settings/workflow, but perhaps the above will work for your use case.