How to properly count an actual number of forked child processes?

Some time ago I wrote a simple SMTP gate for automatic S/MIME processing and now it comes to testing. As typical for mail servers, main process forks a child for every incoming connection. It is a good practice to limit the number of created child processes -- and so I done it.

During heavy load (many connections from many clients at the same time) it appears that child processes are not correctly counted -- the problem is in decreasing the counter when children exits. After a few minutes of heavy load counter is greater than actual number of child processes (ie. after 5 minutes it equals 14, but there are none).

I already did some research, but nothing worked. All zombie processes are reaped, so SIGCHLD handling seem to be ok. I thought that it may be a synchronization problem, but adding a mutex and changing variable type to volatile sig_atomic_t (as it is now) gives no change. It is also not a problem with signal masking, I tried masking all signal using sigfillset(&act.sa_mask).

I noticed that waitpid() sometimes returns strange PID values (very large, like 172915914).

Questions and some code.

Is it possible that other process (ie. init) is reaping some of them?
Can a process not become a zombie after exit? Can it be reaped automatically?
How to fix it? Maybe there is a better way of counting them?

Forking a child in main():

volatile sig_atomic_t sproc_counter = 0;    /* forked subprocesses counter */

/* S/MIME Gate main function */
int main (int argc, char **argv)
{
    [...]

    /* set appropriate handler for SIGCHLD */
    Signal(SIGCHLD, sig_chld);

    [...]

    /* SMTP Server's main loop */
    for (;;) {

        [...]

        /* check whether subprocesses limit is not exceeded  */
        if (sproc_counter < MAXSUBPROC) {
            if ( (childpid = Fork()) == 0) {    /* child process */
                Close(listenfd);                /* close listening socket */
                smime_gate_service(connfd);     /* process the request */
                exit(0);
            }
            ++sproc_counter;
        }
        else
            err_msg("subprocesses limit exceeded, connection refused");

        [...]
    }
    Close(connfd);  /* parent closes connected socket */
}

Signal handling:

Sigfunc *signal (int signo, Sigfunc *func)
{
    struct sigaction    act, oact;

    act.sa_handler = func;
    sigemptyset(&act.sa_mask);
    act.sa_flags = 0;

    if (signo == SIGALRM) {
#ifdef  SA_INTERRUPT
        act.sa_flags |= SA_INTERRUPT;   /* SunOS 4.x */
#endif
    }
    else {
#ifdef  SA_RESTART
        act.sa_flags |= SA_RESTART;     /* SVR4, 44BSD */
#endif
    }
    if (sigaction(signo, &act, &oact) < 0)
        return SIG_ERR;

    return oact.sa_handler;
}

Sigfunc *Signal (int signo, Sigfunc *func)
{
    Sigfunc *sigfunc;

    if ( (sigfunc = signal(signo, func)) == SIG_ERR)
        err_sys("signal error");
    return sigfunc;
}

void sig_chld (int signo __attribute__((__unused__)))
{
    pid_t pid;
    int stat;

    while ( (pid = waitpid(-1, &stat, WNOHANG)) > 0) {
        --sproc_counter;
        err_msg("child %d terminated", pid);
    }
    return;
}

NOTE: All functions beginning with a capital letter (like Fork(), Close(), Signal() etc.) do and behaves the same as they lower case friends (fork(), close(), signal() etc.), but have better error handling -- so I don't have to check their return statuses.

NOTE2: I run and compile it under Debian Testing (kernel v3.10.11) using gcc 4.8.2.

Solution

I think the signal method can be fixed, while creating a thread forces you to exec a program to handle a connection.

There are several problems:

Changes to sproc_counter may be lost if a process is created and ended at the same time. To fix this, either use signal masks (e.g., sigprocmask(), pselect()) to ensure the handler is not invoked while the main flow is manipulating sproc_counter, or make the signal handler set a flag and perform the waitpid(), counter manipulation and logging in the main flow (but not in a new thread). Note that the flag method still requires signal mask manipulation if you want to avoid sleeping for a new connection or another ending connection directly after an ending connection.
err_msg() is probably not async-signal safe. I see three options:
- use the flag method mentioned above, or
- ensure no async-signal unsafe functions are called while SIGCHLD is unmasked, or
- remove the call from the signal handler.
Overriding signal() may cause other code to invoke your version instead of the standard version. This is likely to lead to strange behaviour.
The signal handler does not save and restore the value of errno.

If you have problems because of signals interrupting other signals, that's what sigaction's sa_mask field is for.