Search code examples
linuxsshpipeforksshd

Why are hanging SSH commands waiting for output from a pipe with both ends open in 'sshd' on the server?


This is on StackOverflow as opposed to SuperUser/ServerFault since it has to do with the syscalls and OS interactions being performed by sshd, not the problem I'm having using SSH (though assistance with that is appreciated, too :p).

Context:

I invoke a complex series of scripts via SSH, e.g. ssh user@host -- /my/command. The remote command does a lot of complex forking and execcing and eventually results in a backgrounded daemon process running on the remote host. Occasionally (I'm slowly going mad trying to find out reliable reproduction conditions), the ssh command will never return control to the client shell. In those situations, I can go onto the target host and see an sshd: user@notty process with no children hanging indefinitely.

Fixing that issue is not what this question is about. This question is about what that sshd process is doing.

The SSH implementation is OpenSSH, and the version version is 5.3p1-112.el6_7.

The problem:

If I find one of those stuck sshds and strace it, I can see it's doing a select on two handles, e.g. select(12, [3 6], [], NULL, NULL or similar. lsof tells me that one of those handles is the TCP socket connecting back to the SSH client. The other is a pipe, the other end of which is only open in the same sshd process. If I search for that pipe by ID using the answer to this SuperUser question, the only process that contains references to that pipe is the same process. lsof confirms this: both the read and write ends of the pipe are open in the same process, e.g. (for pipe 788422703 and sshd PID 22744):

sshd    22744 user    6r  FIFO                0,8      0t0 788422703 pipe
sshd    22744 user    7w  FIFO                0,8      0t0 788422703 pipe 

Questions:

What is SSH waiting for? If the pipe isn't connected to anything and there are no child processes, I can't imagine what event it could be expecting.

What is that "looped" pipe/what does it represent? My only theory is that maybe if STDIN isn't supplied to the SSH client, the target host sshd opens a dummy STDIN pipe so some of its internal child-management code can be more uniform? But that seems pretty tenuous.

How does SSH get into this situation?

What I've Tried/Additional Info:

  • Initially, I thought this was a handle leak to a daemon. It's possible to create a waiting, child-less sshd process by issuing a command that backgrounds itself, e.g. ssh user@host -- 'sleep 60 &'; sshd will wait for the streams to be closed to the daemonized process; not just the exit of its immediate child. Since the scripts in question eventually result (way down the process tree) in a daemon being started, it initially seemed possible that the daemon was holding onto a handle. However, that doesn't seem to hold up--using the sleep 60 & command as an example, sshd processes communicating with daemons hold and select on four open pipes, not just two, and at least two of the pipes are connected from sshd to the daemon process, not looped. Unless there's a method of tracking/pointing to a pipe I don't know about (and there likely is--for example, I have no idea how duped filehandles play into close() semaphore waiting or piping), I don't think the pipe-to-self situation represents a waiting-on-daemon case.
  • sshd periodically receives communication on the TCP socket/ssh connection itself, which wakes it up out of the selects for a brief period of communication (during which strace shows it blocking SIGCHLD), and then it goes back to waiting on the same FDs.
  • It's possible that I'm being affected by this race condition (SIGCHLD getting delivered before the kernel makes data available in the pipe). However, that seems unlikely, both given the rate at which this condition manifests, and the fact that the processes being run on the target host are Perl scripts, and the Perl runtime closes and flushes open file descriptors on shutdown.

Solution

  • It seems that you're describing the notify pipe. The OpenSSH sshd main loop calls select() to wait until it has something to do. The file descriptors being polled include the TCP connection to the client and any descriptors used to service active channels.

    sshd wants to be able to interrupt the select() call when a SIGCHLD signal is received. To do that, sshd installs a signal handler for SIGCHLD and it creates a pipe. When a SIGCHLD signal is received, the signal handler writes a byte into the pipe. The read end of the pipe is included in the list of file descriptors polled by select(). The act of writing to the pipe would cause the select() call to return with an indication that the notify pipe is readable.

    All of the code is in serverloop.c:

    /*
     * we write to this pipe if a SIGCHLD is caught in order to avoid
     * the race between select() and child_terminated
     */
    static int notify_pipe[2];
    static void
    notify_setup(void)
    {
            if (pipe(notify_pipe) < 0) {
                    error("pipe(notify_pipe) failed %s", strerror(errno));
            } else if ((fcntl(notify_pipe[0], F_SETFD, 1) == -1) ||
                (fcntl(notify_pipe[1], F_SETFD, 1) == -1)) {
                    error("fcntl(notify_pipe, F_SETFD) failed %s", strerror(errno));
                    close(notify_pipe[0]);
                    close(notify_pipe[1]);
            } else {
                    set_nonblock(notify_pipe[0]);
                    set_nonblock(notify_pipe[1]);
                    return;
            }
            notify_pipe[0] = -1;    /* read end */
            notify_pipe[1] = -1;    /* write end */
    }
    static void
    notify_parent(void)
    {
            if (notify_pipe[1] != -1)
                    write(notify_pipe[1], "", 1);
    }
    [...]
    
    /*ARGSUSED*/
    static void
    sigchld_handler(int sig)
    {
            int save_errno = errno;
            child_terminated = 1;
    #ifndef _UNICOS
            mysignal(SIGCHLD, sigchld_handler);
    #endif
            notify_parent();
            errno = save_errno;
    }
    

    The code to set up and perform the select call is in another function called wait_until_can_do_something(). It's fairly long so I won't include it here. OpenSSH is open source, and this page describes how to download the source code.