I'm having a hard time figuring out why I can't launch commands in parallel using the LSF blaunch
command:
for num in `seq 3`; do
blaunch -u JobHost ./cmd_${num}.sh &
done
Error message:
Oct 29 13:08:55 2011 18887 3 7.04 lsb_launch(): Failed while executing tasks.
Oct 29 13:08:55 2011 18885 3 7.04 lsb_launch(): Failed while executing tasks.
Oct 29 13:08:55 2011 18884 3 7.04 lsb_launch(): Failed while executing tasks.
Removing the ampersand (&
) allows the commands to execute sequentially, but I am after parallel execution.
When executed within the context of bsub, a single invocation of blaunch -u <hostfile> <cmd>
will take <cmd>
and run it on all the hosts specified in <hostfile>
in parallel as long as those hosts are within the job's allocation.
What you're trying to do is use 3 separate invocations of blaunch
to run 3 separate commands. I can't find it in the documentation, but just some testing on a recent version of LSF shows that each individually executed task in such a job has a unique task ID stored for it in an environment variable called LSF_PM_TASKID. You can verify this in your version of LSF by running something like:
blaunch -I -n <num_tasks> blaunch env | grep TASKID
Now, what does this have to do with your question? You want to run ./cmd_$i.sh
for i=1,2,3 in parallel through blaunch
. To do this you can write a single script which I'll call cmd.sh
as follows:
#!/bin/sh
./cmd_${LSF_PM_TASKID}.sh
Now you can replace your for loop with a single invocation of blaunch
like so:
blaunch -u JobHost cmd.sh
This will run one instance of cmd.sh
on each host listed in the file 'JobHost' in parallel, each of these instances will run the shell script cmd_X.sh
where X
is the value of $LSF_PM_TASKID
for that particular task.
If there's exactly 3 hostnames in 'JobHost' then you will get 3 instances of cmd.sh
which will in turn lead to one instance each of cmd_1.sh
, cmd_2.sh
, and cmd_3.sh