Search code examples
bashlsf

How can I use the Platform LSF blaunch command to start processes simultaneously?


I'm having a hard time figuring out why I can't launch commands in parallel using the LSF blaunch command:

for num in `seq 3`; do
blaunch -u JobHost ./cmd_${num}.sh &
done

Error message:

Oct 29 13:08:55 2011 18887 3 7.04 lsb_launch(): Failed while executing tasks.
Oct 29 13:08:55 2011 18885 3 7.04 lsb_launch(): Failed while executing tasks.
Oct 29 13:08:55 2011 18884 3 7.04 lsb_launch(): Failed while executing tasks.

Removing the ampersand (&) allows the commands to execute sequentially, but I am after parallel execution.


Solution

  • When executed within the context of bsub, a single invocation of blaunch -u <hostfile> <cmd> will take <cmd> and run it on all the hosts specified in <hostfile> in parallel as long as those hosts are within the job's allocation.

    What you're trying to do is use 3 separate invocations of blaunch to run 3 separate commands. I can't find it in the documentation, but just some testing on a recent version of LSF shows that each individually executed task in such a job has a unique task ID stored for it in an environment variable called LSF_PM_TASKID. You can verify this in your version of LSF by running something like:

    blaunch -I -n <num_tasks> blaunch env | grep TASKID
    

    Now, what does this have to do with your question? You want to run ./cmd_$i.sh for i=1,2,3 in parallel through blaunch. To do this you can write a single script which I'll call cmd.sh as follows:

    #!/bin/sh
    ./cmd_${LSF_PM_TASKID}.sh
    

    Now you can replace your for loop with a single invocation of blaunch like so:

    blaunch -u JobHost cmd.sh
    

    This will run one instance of cmd.sh on each host listed in the file 'JobHost' in parallel, each of these instances will run the shell script cmd_X.sh where X is the value of $LSF_PM_TASKID for that particular task.

    If there's exactly 3 hostnames in 'JobHost' then you will get 3 instances of cmd.sh which will in turn lead to one instance each of cmd_1.sh, cmd_2.sh, and cmd_3.sh