Figuring out what startpar.c (sysvinit) is doing

OK here's a long one, brace yourself! :)

Recently I tried launching a watchdog script written in bash, during boot. So I added a line to rc.local containing the following:

su someuser -c "/home/someuser/watchdog.sh &"

the watchdog.sh looks like this:

#!/bin/bash
until /home/someuser/eventMonitoring.py
do
    sleep 1
done

All is fine, all is good, the script gets started and all. However a new process appears in the processes list, and stays there forever:

UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root      3048     1  0  1024   620   1 20:04 ?        00:00:00 startpar -f -- rc.local

Now, my script (watchdog.sh) got launched and was successfully detached because its PPID is also 1. I was then on a mission to find out what that process is. Startpar is part of sysvinit boot system (http://savannah.nongnu.org/projects/sysvinit). I'm currently on a Debian Wheezy 7.4.0 which uses that system. Now man startpar says:

startpar is used to run multiple run-level scripts in parallel.

By a method of trial and error I basically figured out how to properly launch my script during boot and not leave startpar hanging. All file descriptors of a process need to be redirected to either a file or /dev/null or closed all together. Which when you think about is a rational thing to do. I finally did it like this:

su someuser -c "some_script.sh >/dev/null 2>&1 &"

That resolved the issue. But still left me wondering why that is. Why startpar behaves like it does. Is it a bug or is it a feature.

So I dived a bit into the code(http://svn.savannah.nongnu.org/viewvc/startpar/trunk/startpar.c?root=sysvinit&view=markup) and started going from the end to the beginning:

First I located where that startpar -f -- rc.local call is made:
line 741:

execlp(myname, myname, "-f", "--", p->name, NULL);

Ok so this will actually start a new startpar process which will replace the current running instance. It's basically a recursive call on itself. Lets look what that -f parameter does:

line 866:

case 'f':
      forw = 1;
      break;

OK, let's see what setting forw variable to 1 does...
line 900:

if (forw)
    do_forward();

And finally let's see what's up with that function:

line 615:

void do_forward(void)
{
  char buf[4096], *b;
  ssize_t r, rr;
  setsid();
  while ((r = read(0, buf, sizeof(buf))))
    {
      if (r < 0)
    {
      if (errno == EINTR)
        continue;
#if defined(DEBUG) && (DEBUG > 0)
      perror("\n\rstartpar: forward read");
#endif
      break;
    }
      b = buf;
      while (r > 0)
    {
      rr = write(1, b, r);
      if (rr < 0)
        {
          if (errno == EINTR)
        continue;
          perror("\n\rstartpar: forward write");
          rr = r;
        }
      r -= rr;
      b += rr;
    }
    }
  _exit(0);
}

As far as I understand this. This will redirect all that is coming from file descriptor 0, to file descriptor 1. Now let's see what is really linked to those file descriptors:

root@server:~# ls -al /proc/3048/fd
total 0
dr-x------ 2 root root  0 Apr  2 21:13 .
dr-xr-xr-x 8 root root  0 Apr  2 21:13 ..
lrwx------ 1 root root 64 Apr  2 21:13 0 -> /dev/ptmx
lrwx------ 1 root root 64 Apr  2 21:13 1 -> /dev/console
lrwx------ 1 root root 64 Apr  2 21:13 2 -> /dev/console

Hmm interesting... So ptmx is according to man:

The file /dev/ptmx is a character file with major number 5 
and minor number 2, usually of mode 0666 and owner.group of root.root. 
It is used to create a pseudoterminal master and slave pair.

and console:

The current console is also addressed by
/dev/console or /dev/tty0, the character device with major number 4
and minor number 0.

And at that point I came here to stackoverflow. Now, can someone tell me what is going on here? Did I get this right, that startpar is left in a stage of constantly redirecting whatever comes to ptmx to the console? Why is it doing that? Why ptmx? Is this a bug?

Solution

TL;DR

This is definitely NOT a bug with startpar, which is doing exactly what it promises to in the first place.

The output of each script is buffered and written when the script exits, so output lines of different scripts won't mix. You can modify this behaviour by setting a timeout.

Code details

Within the run() function in startpar.c,

Line 422: Obtain a handle to the master pseudoterminal (/dev/ptmx in this case)

p->fd = getpt();
Line 429: Obtain the path of the corresponding slave pseudoterminal

else if ((m = ptsname(p->fd)) == 0 || grantpt(p->fd) || unlockpt(p->fd))
Line 438: Fork a child process

if ((p->pid = fork()) == (pid_t)-1)
Line 475: Invalidate default stdout

TEMP_FAILURE_RETRY(close(1));
Line 476: Obtain a handle to slave pseudoterminal. Now, this is 1, i.e. the stdout of child now redirects to the slave pseudoterminal (and is received by the master pseudoterminal node).

if (open(m, O_RDWR) != 1)
Line 481: Also capture stderr by duplicating it with the salve pseudoterminal fd.

TEMP_FAILURE_RETRY(dup2(1, 2));
Line 561: After some book-keeping stuff, launch the executable of interest(as the child process)

execlp(p->name, p->arg0, (char *)0);
The parent process can then later on capture all the output/error logs of this newly launched process by reading the buffered master pseudoterminal and log it to the actual stdout (i.e. /dev/console in this case).

How to prevent a dangling `startpar -f ...` process on your system?

Method 1: Define the executable to be launched as interactive.

Explicitly marking a executable interactive tells startpar to skip the psedoterminal master/slave trickery to buffer the terminal I/O as any output of the launched interactive executable needs to be displayed on screen immediately and not buffered.

This modifies the flow of execution in several places. Mainly at Line 1171, where startpar does NOT call the run() function for an interactive executable.

This has been tested and described here.

Method 2: Discard `stdout` and `stderr` of the executable to be launched.

Using the construct ">/dev/null 2>&1 &" discard stdout/stderr of the executable to be launched. If they are both explicitly set to NULL i.e. startpar does NOT buffer them indefinitely as it usually does otherwise.

Method 3: Set an explicit timeout for `startpar`

Either configure timo in startpar.c

The timeout set with the -t option is used as buffer timeout. If the output buffer of a script is not empty and the last output was timeout seconds ago, startpar will flush the buffer.

or gtimo in startpar.c

The -T option timeout works more globally. If no output is printed for more than global_timeout seconds, startpar will flush the buffer of the script with the oldest output. Afterwards it will only print output of this script until it is finished.