This is kind of a general best practice question. I have a Python script which iterates over some arguments and calls another script with those arguments (it's basically a grid search for some simple Deep Learning models). This works fine on my local machine, but now I need the resources of my unis computer cluster, which uses SLURM. I have some logic in the python script that I think would be difficult to implement, and maybe out of place, in a shell script. I also can't just throw all the jobs at the cluster at once, because I want to skip certain parameter combination depending on the outcome (loss) of others. Now I'd like to submit the SLURM jobs directly from my python script and still handle the more complexe logic there. My question now is what the best way to implement something like this is and if running a python script on the login node would be bad mannered. Should I use the subprocess module? Snakemake? Joblib? Or are there other, more elegant ways?
Snakemake and Joblib are valid options, they will handle the communication with the Slurm cluster. Another possibility is Fireworks. This one is a bit more tedious to get running ; it needs a MongoDB database, and has a vocabulary that needs getting used to, but in the end it can do very complex stuff. You can for instance create a workflow that submits jobs to multiple clusters and run other jobs dependent of the output of the previous ones, and automatically re-submit the ones that failed, with other parameters if needed.