I cannot understand if what I want to do in Dask is possible...
Currently, I have a long list of heavy files. I am using multiprocessing library to process every entry of the list. My function opens and entry, operates on it, saves the result in a binary file to disk, and returns None. Everything works fine. I did this essentially to reduce RAM usage.
I would like to do "the same" in Dask, but I cannot figure out how to save binary data in parallel. In my mind, it should be something like:
for element in list:
new_value = func(element)
new_value.tofile('filename.binary')
where there can only be N elements loaded at once, where N is the number of workers, and each element is used and forgotten at the end of each cycle.
Is it possible?
Thanks a lot for any suggestion!
That does sound like a feasible task:
from dask import delayed, compute
@delayed
def myfunc(element):
new_value = func(element)
new_value.tofile('filename.binary') # you might want to
# change the destination for each element...
delayeds = [myfunc(e) for e in list]
results = compute(delayeds)
If you want fine control over tasks, you might want to explicitly specify the number of workers by starting a LocalCluster
:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=3)
client = Client(cluster)
There is a lot more that can be done to customize the settings/workflow, but perhaps the above will work for your use case.