Search code examples
linuxperllsf

Perl system(), exec() and interactions with LSF


I have a script that has to kick off 2 independent processes, and wait until one of them finishes before continuing.

Up to now, I've run it by creating one process with an if fork pid == 0, exec, else wait. The other one is created using system and the command line.

Now I'm preparing to roll this script out to run 400 iterations of such work-pair processes on Platform Load Sharing Facility (LSF), however I'm concerned with stability. I know that the processes can crash. In such a case, I need a method to know when a process has crashed, and kill its pair process and the main script.

Originally I had written a watchdog with a 3 minute watch period, if 3 minutes of inactivity pass, it kills the processes. However this caught a lot of false positives, because when the LSF suspends one of the two processes, the watchdog saw them as inactive.

In LSF, when I issue the jobs, I have the option to kill them. However, when I kill a job, what exactly do I kill? Will the kill take down the two processes the Perl script has created? or leave them running as zombies?

To reiterate,

  • Will killing a job on the LSF queue also kill every process that job has created?

  • Whats the best (safest?) way to generate two independent processes from a Perl script, and to wait until one of them exits before continuing?

  • How can I write a watchdog that can distinguish between a processes having crashed, and a process that is suspended by the LSF admin?


Solution

  • The monitor is the one that should be creating the child processes. (It can also launch the "main script" too.) wait will tell you when they crash.

    my %children;
    
    my $pid1 = fork();
    if (!defined($pid1)) { ... }
    if ($pid1) { ... }
    ++$children{$pid1};
    
    my $pid2 = fork();
    if (!defined($pid2)) { ... }
    if ($pid2) { ... }
    ++$children{$pid2};
    
    while (keys(%children)) {
       my $pid = wait();
       next if !$children{$pid};  # !!!
    
       delete($children{$pid});
    
       if ($? & 0x7F) { ... }   # Killed from signal
       if ($? >> 8) { ... }     # Returned an error
    }