Lately, I've been trying to do some machine learning work with Dask on an HPC cluster which uses the SLURM scheduler. Importantly, on this cluster SLURM is configured to have a hard wall-time limit of 24h per job.
Initially, I ran my code with a single worker, but my job was running out of memory. I tried to increase the number of workers (and, therefore, the number of requested nodes), but the workers got stuck in the SLURM queue (with the reason for such being labeled as "Priority"). Meanwhile, the master would run and eventually hit the wall-time, leaving the workers to die when they finally started.
Thinking that the issue might be my requesting too many SLURM jobs, I tried condensing the workers into a single, multi-node job using a workaround I found on github. Nevertheless, these multi-node jobs ran into the same issue.
I then attempted to get in touch with the cluster's IT support team. Unfortunately, they are not too familiar with Dask and could only provide general pointers. Their primary suggestions were to either put the master job on hold until the workers were ready, or launch new masters every 24h until the the workers could leave the queue. To help accomplish this, they cited the SLURM options --begin and --dependency. Much to my chagrin, I was unable to find a solution using either suggestion.
As such, I would like to ask if, in a Dask/SLURM environment, there is a way to force the master to not start until the workers are ready, or to launch a master that is capable of "inheriting" workers previously created by another master.
Thank you very much for any help you can provide.
The answer to my problem turned out to be deceptively simple. Our SLURM configuration uses the backfill scheduler. Because my Dask workers were using the maximum possible --time (24 hours), this meant that the backfill scheduler wasn't working effectively. As soon as I lowered --time to the amount I believed was necessary for the workers to finish running the script, they left "queue hell"!