I have a Perl script (snippet below) that runs in cron to perform system checks. I fork a child as a timeout and reap it with SIG{CHLD}. Perl does several system calls of Bash scripts and checks their exit status. One bash script fails about 5% of the time with no error. The Bash scripts exists with 0 and Perl sees $? as -1 and $! as "No child processes".
This bash script tests compiler licenses, and Intel icc is left around after the Bash script completes (ps output below). I think the icc zombie completes, forcing Perl into SIG{CHLD} handler, which blows away the $? status before I'm able to read it.
Compile status -1; No child processes
#!/usr/bin/perl
use strict;
use POSIX ':sys_wait_h';
my $GLOBAL_TIMEOUT = 1200;
### Timer to notify if this program hangs
my $timer_pid;
$SIG{CHLD} = sub {
local ($!, $?);
while((my $pid = waitpid(-1, WNOHANG)) > 0)
{
if($pid == $timer_pid)
{
die "Timeout\n";
}
}
};
die "Unable to fork\n" unless(defined($timer_pid = fork));
if($timer_pid == 0) # child
{
sleep($GLOBAL_TIMEOUT);
exit;
}
### End Timer
### Compile test
my @compile = `./compile_test.sh 2>&1`;
my $status = $?;
print "Compile status $status; $!\n";
if($status != 0)
{
print "@compile\n";
}
END # Timer cleanup
{
if($timer_pid != 0)
{
$SIG{CHLD} = 'IGNORE';
kill(15, $timer_pid);
}
}
exit(0);
#!/bin/sh
cc compile_test.c
if [ $? -ne 0 ]; then
echo "Cray compiler failure"
exit 1
fi
module swap PrgEnv-cray PrgEnv-intel
cc compile_test.c
if [ $? -ne 0 ]; then
echo "Intel compiler failure"
exit 1
fi
wait
ps
exit 0
The wait doesn't really wait because cc calls icc which creates a zombie grandchild process that wait (or wait PID) doesn't block for. (wait `pidof icc`, 31589 in this case, gives "not a child of this shell")
user 31589 1 0 12:47 pts/15 00:00:00 icc
I just don't know how to fix this in Bash or Perl.
Thanks, Chris
I thought the quickest solution would be to add sleep of a second or two at the bottom of the bash script to wait for the zombie icc to complete. But that didn't work.
If I didn't already have a SIG ALRM (in the real program) I agree the best choice would be to wrap the whole thing in a eval. Even thought that would be pretty ugly for a 500 line program.
Without the local($?), every `system` call gets $? = -1. The $? I need in this case is after waitpid, then unfortunately set to -1 after the sig handler exits. So I find this works. New lines shown with ###
my $timer_pid;
my $chld_status; ###
$SIG{CHLD} = sub {
local($!, $?);
while((my $pid = waitpid(-1, WNOHANG)) > 0)
{
$chld_status = $?; ###
if($pid == $timer_pid)
{
die "Timeout\n";
}
}
};
...
my @compile = `./compile_test.sh 2>&1`;
my $status = ($? == -1) ? $chld_status : $?; ###
...