Perl system(), exec() and interactions with LSF

I have a script that has to kick off 2 independent processes, and wait until one of them finishes before continuing.

Up to now, I've run it by creating one process with an if fork pid == 0, exec, else wait. The other one is created using system and the command line.

Now I'm preparing to roll this script out to run 400 iterations of such work-pair processes on Platform Load Sharing Facility (LSF), however I'm concerned with stability. I know that the processes can crash. In such a case, I need a method to know when a process has crashed, and kill its pair process and the main script.

Originally I had written a watchdog with a 3 minute watch period, if 3 minutes of inactivity pass, it kills the processes. However this caught a lot of false positives, because when the LSF suspends one of the two processes, the watchdog saw them as inactive.

In LSF, when I issue the jobs, I have the option to kill them. However, when I kill a job, what exactly do I kill? Will the kill take down the two processes the Perl script has created? or leave them running as zombies?

To reiterate,

Will killing a job on the LSF queue also kill every process that job has created?
Whats the best (safest?) way to generate two independent processes from a Perl script, and to wait until one of them exits before continuing?
How can I write a watchdog that can distinguish between a processes having crashed, and a process that is suspended by the LSF admin?

Solution

The monitor is the one that should be creating the child processes. (It can also launch the "main script" too.) wait will tell you when they crash.

my %children;

my $pid1 = fork();
if (!defined($pid1)) { ... }
if ($pid1) { ... }
++$children{$pid1};

my $pid2 = fork();
if (!defined($pid2)) { ... }
if ($pid2) { ... }
++$children{$pid2};

while (keys(%children)) {
   my $pid = wait();
   next if !$children{$pid};  # !!!

   delete($children{$pid});

   if ($? & 0x7F) { ... }   # Killed from signal
   if ($? >> 8) { ... }     # Returned an error
}