c fork zombie-process waitpid posix-select

cleaning child processes with handler with waitpid pselect fork sigaction

I have a server that receives messages into a socket and for each message received, do a fork exec. This part seems to be working properly.

But I need to do this in non-blocking mode, so I've created a handler to clean properly all the terminated child processes with waitpid() ( as explained in many topics in forums ).

the problem is that this handler generates an Interrupted system call to my pselect command and it stops the program with the following message:
"select(): Interrupted system call"

I found some explanations of this problem in forums with "Preventing race conditions" and so on, so I've tried to use sigprocmask() to block several signals, but it didn't work.

I'm sure this is a trivial problem, but this is my first use of this kind of program.

I would need some help. thanks in advance.

Here is the program :

void
clean_up_child_process (int signal_number)
{

  pid_t p;
  int status;

  while (1)
    {
      p = waitpid (-1, &status, WNOHANG);

      if (p == -1)
        {
          if (errno == EINTR)
            {
              continue;
            }
          break;
        }
      else if (p == 0)
        {
          break;
        }
    }


}

static void
app (void)
{
  SOCKET sock;
  char commande[BUF_SIZE];
  char res_cmd[BUF_SIZE];
  int max;
  int n;

  sock = init_connection ();
  max = sock;
  fd_set rdfs;

  sigemptyset (&sigmask);
  sigaddset (&sigmask, SIGCHLD);
  sigaddset (&sigmask, SIGINT);
  sigaddset (&sigmask, SIGTSTP);
  sigaddset (&sigmask, SIGTERM);
  sigprocmask (SIG_BLOCK, &sigmask, NULL);

  struct sigaction sigchld_action;
  memset (&sigchld_action, 0, sizeof (sigchld_action));
  sigchld_action.sa_handler = &clean_up_child_process;
  sigaction (SIGCHLD, &sigchld_action, NULL);

  while (1)
    {
      int i = 0;
      FD_ZERO (&rdfs);

      /* add STDIN_FILENO */
      FD_SET (STDIN_FILENO, &rdfs);

      /* add the connection socket */
      FD_SET (sock, &rdfs);

      sigemptyset (&empty_mask);
      if (pselect (max + 1, &rdfs, NULL, NULL, NULL, &empty_mask) == -1)
        if (errno != EINTR)
          {
            perror ("select()");
            exit (errno);
          }

      if (FD_ISSET (STDIN_FILENO, &rdfs))
        {
          /* stop process when type on keyboard */
          // break; must be disable to avoid bad exits
        }
      else if (FD_ISSET (sock, &rdfs))
        {
          /* new client */
          SOCKADDR_IN csin = { 0 };
          size_t sinsize = sizeof csin;
          int csock = accept (sock, (SOCKADDR *) & csin, &sinsize);
          if (csock == SOCKET_ERROR)
            {
              perror ("accept()");
              continue;
            }

          if ((n = recv (csock, commande, BUF_SIZE - 1, 0)) < 0)
            {
              perror ("recv(commande)");
              n = 0;
              continue;
            }
          commande[n] = 0;
          if ((n = fork ()) == -1)
            perror ("fork()");
          else if (n == 0)
            {
              close (STDOUT_FILENO);
              dup (csock);
              close (STDERR_FILENO);
              dup (csock);
              execlp (commande, commande, 0);
            }
          else
            {
              closesocket (csock);
            }
        }
    }
  end_connection (sock);
}

Solution

You need to learn a little more about POSIX signal handling.

When a signal is received during an interruptible system call (in this instance pselect), the signal call will exit back to userspace and the signal handler is invoked. After the signal handler is complete, then the normal behaviour is that the signal call returns EINTR. On some systems it is possible to avoid this by making the signal action SA_RESTART in which case the kernel will automatically restart the system call. That sounds like a great option until you realise that often you want to trap signals like SIGINT and make them set a global variable (e.g. to quit the program) and test for that. Hence constructs like the following (adapted for your program) are common:

volatile sig_atomic_t rxsig_quit = 0;

void
handlesignal (int sig)
{
  /* Only do signal safe things here; remember mutexes may be held */
  switch (sig)
    {
    case SIGINT:
    case SIGTERM:
      rxsig_quit++;
      break;
    case SIGCHLD:
      /* do all our waiting here */
      while (1)
        {
          int status;
          waitpid (WAIT_ANY, &status, WNOHANG);
        }
      break;
    }
}

static void
app (void)
{

  /* ... */

  while (!rxsig_quit)
    {
      /* ... */

      do
        {
          int ret;
          ret = pselect (max + 1, &rdfs, NULL, NULL, NULL, &empty_mask);
        }
      while ((ret < 0) && (errno == EINTR) && !rxsig_quit);

      /* ... */
    }

  /* ... */
}

You can get more information using man -s7 signal. This also lists async-safe functions, IE the functions you can safely call in a signal handler.

You are, however, assuming that you need to do the wait at all. On modern POSIX systems, this is not the case. You can set SIGCHLD to SIG_IGN, in which case the OS will do the work, as per this paragraph from the manpage for wait(2):

POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see sigaction(2)), then children that terminate do not become zombies and a call to wait() or waitpid() will block until all children have terminated, and then fail with errno set to ECHILD. (The original POSIX standard left the behavior of setting SIGCHLD to SIG_IGN unspecified. Note that even though the default disposition of SIGCHLD is "ignore", explicitly setting the disposition to SIG_IGN results in different treatment of zombie process children.) Linux 2.6 conforms to this specification. However, Linux 2.4 (and earlier) does not: if a wait() or waitpid() call is made while SIGCHLD is being ignored, the call behaves just as though SIGCHLD were not being ignored, that is, the call blocks until the next child terminates and then returns the process ID and status of that child.

Obviously this is less portable.