Search code examples
pythondaskdask-delayed

Tasks management and monitoring within python script via dask


I have a project folder with many sub-folders ( say 100). The python script navigates to each of these sub folders, calls an executable, writes the results to an out file and moves on to the next subfolder.

Here is my python script

from dask_jobqueue import PBSCluster   
cluster = PBSCluster()
cluster.scale(jobs=3)  

from dask.distributed import Client
client = Client(cluster)
...

r_path='/path/to/project/folder'


def func():
    f = open('out', 'w')
   (subprocess.call(["/path/to/executable/file"], stdout=f))

for root, dirs, files in os.walk("."):
    for name in dirs:
        os.chdir(r_path+'/'+str(name))
        func()

In the project,

  1. The out file(s) needs to be used for further computations therefore the script needs to know when the execution got completed for a given sub folder
  2. The executable need to be restricted to 10 sub folders at any given time and of those 10, after the completion of any of the execution, a new one in another sub-folder needs to be started.

Could someone please let me know if it is possible to use dask to do this ?


Solution

  • Yes, it is possible to use Dask to do this. You probably want to read the documentation around Dask delayed or Dask futures.

    https://docs.dask.org/en/latest/delayed.html

    https://docs.dask.org/en/latest/futures.html