Search code examples
clinuxshellunixjob-control

Background and suspended processes - Implementing a Job Control Shell in C


I'm implementing a Job Control Shell in C in Linux as a project for a Operating Systems-related subject. I have a main() function that does child process management, helped with a linked list as shown here in which background and suspended jobs information is stored:

typedef struct job_
{
    pid_t pgid; /* group id = process lider id */
    char * command; /* program name */
    enum job_state state;
    struct job_ *next; /* next job in the list */
} job;

Every time a child process exits or is stopped, a SIGCHLD is sent to parent process to be informed about that. Then, I have a signal handler as shown here that for each node of that job status linked list, checks if the process represented in that node has exited and if it did, that node is removed from the linked list. Here is the code for the SIGCHLD handler, where 'job_list' is the linked list where the info is stored:

void mySIGCHLD_Handler(int signum) {
    block_SIGCHLD();
    if (signum == 17) {
        job *current_node = job_list->next, *node_to_delete = NULL;
        int process_status, process_id_deleted;

        while (current_node) {

            /* Wait for a child process to finish.
            *    - WNOHANG: return immediately if the process has not exited
            */
            waitpid(current_node->pgid, &process_status, WNOHANG);

            if (WIFEXITED(process_status) != 0) {
                node_to_delete = current_node;
                current_node = current_node->next;
                process_id_deleted = node_to_delete->pgid;
                if (delete_job(job_list, node_to_delete)) {
                printf("Process #%d deleted from job list\n", process_id_deleted);
                } else {
                    printf("Process #%d could not be deleted from job list\n", process_id_deleted);
                }
            } else {
                current_node = current_node->next;
            }
        }
    }
    unblock_SIGCHLD();
}

The thing is, when the handler is called, some entries that should not be deleted because the process they represent are not exited, are deleted, when they shouldn't. Anyone would know why that happens?

Thank you and sorry for your lost time :(


Solution

  • I see many problems in this code, but the immediate issue is probably here:

            waitpid(current_node->pgid, &process_status, WNOHANG);
            if (WIFEXITED(process_status) != 0) {
    

    When waitpid(pid, &status, WNOHANG) returns because the process has not exited, it does not write anything to status, so the subsequent if is branching on garbage. You need to check the actual return value of waitpid before assuming status is meaningful.

    The most important other problems are:

    • The kernel is allowed to send only one SIGCHLD to tell you that several processes have exited. When you get a SIGCHLD, you need to call waitpid(0, &status, WNOHANG) in a loop until it tells you there are no more processes to wait for, and you need to process (no pun intended) all of the exited process IDs that it tells you about.

    • It is not safe to call printf or free from an asynchronous signal handler. Add terminated processes to a list of deferred tasks, instead. Make sure to block SIGCHLD in the main-loop code that consumes that list.

    • Don't block and unblock SIGCHLD yourself in the handler; that has an unavoidable race condition. Instead, let the kernel do it for you, atomically, by setting up your signal handler correctly: use sigaction and don't put SA_NODEFER in sa_flags. (Do put SA_RESTART in sa_flags, unless you have a very good reason not to.)

    • The literal number 17 should be the signal constant SIGCHLD instead. Some signal numbers have been stable across all Unixes throughout history, but SIGCHLD is not one of them.