I have a setup consisting from 3 workers and a management node, which I use for submitting tasks. I would like to execute concurrently a setup script at all workers:
bsub -q queue -n 3 -m 'h0 h1 h2' -J "%J_%I" mpirun setup.sh
As far as I understand, I could use 'ptile' resource constraint to force execution at all workers:
bsub -q queue -n 3 -m 'h0 h1 h2' -J "%J_%I" -R 'span[ptile=1]' mpirun setup.sh
However, occasionally I face an issue that my script got executed several times at the same worker.
Is it expected behavior? Or there is a bug in my setup? Is there a better way for enforcing multi worker execution?
Your understanding of span[ptile=1]
is correct. LSF will only use 1 core per host for your job. If there aren't enough hosts based on the -n
then the job will pend until something frees up.
However, occasionally I face an issue that my script got executed several times at the same worker.
I suspect that its something with your script. e.g., LSF appends to the stdout file by default. Use -oo
to overwrite.