Search code examples
forkipcmultiprocesssignal-handlingresource-cleanup

Handling SIGTERM in Parent Process on Fork Failure - Resource Cleanup Issue


I'm encountering an issue with my multi-process application, where the main process creates child processes, and each child further forks. In the event of a system limit for forks being reached, causing fork failures, I've implemented a mechanism where the child sends a SIGTERM signal to the main process. The main process captures this signal and proceeds to clean up resources (signals, shared memory, message queue).

The signal-based cleanup approach works seamlessly in other parts of my project, such as when the while loop in the code ends correctly;

Sample Default termination:

void run_simulation()
{

    while (i <= 10)
    {
        /* Does operations */
        i++;
    }

    kill(getpid(), SIGTERM);
}

Cleanup at SIGTERM:

/* Clean up resources and exit gracefully */
void clean_and_exit()
{   
    /* Terminate child processes */
    system(GRACEFUL_SHUTDOWN_SCRIPT);

    /* Removes resources*/
    clean_resources();
}

However, I'm facing unexpected errors when using the signal in the case of a failed fork. It seems that the resources have already been released even before sending the signal. Is there any automatic cleanup for a failed fork, or is there an issue with my cleanup process?

Error Messages:

[ERROR] ipcs/sem.c Line: 50 PID = 7022 Error 43 (Identifier removed)

[ERROR] ipcs/msg.c Line: 52 PID = 7256 Error 22 (Invalid argument)

[ERROR] ipcs/msg.c Line: 52 PID = 6981 Error 22 (Invalid argument)

...

Child Fork Failure:

pid = fork();
if (pid == -1)
{
    kill(getppid(), SIGTERM);
}
else if (pid == 0)
{
    /* Child operations*/
}

I'm seeking insights on why the cleanup process after a failed fork might be causing these errors and how to effectively troubleshoot this issue. Additionally, I'm open to suggestions on whether using a message queue instead of signals for communication might be a more robust approach.


Solution

  • The cleanup handler of the father process calls system() which internally forks a shell to run a script ("GRACEFUL_SHUTDOWN_SCRIPT"). The latter likely forks other child processes to run the shell commands inside it. This can't work in situations where the maximum forks number is reached. A check on the return status of system() would help.

    So, I suppose that the graceful shutdown script is not run and consequently the running child processes are not killed.

    Moreover, the father must synchronize with the termination of the child processes with a wait() because IPCs may be removed while childs are still using them. For example, the "identifier removed" error likely comes from a "not yet dead" child which uses an IPC that has been removed (by the father).