This is on StackOverflow as opposed to SuperUser/ServerFault since it has to do with the syscalls and OS interactions being performed by sshd, not the problem I'm having using SSH (though assistance with that is appreciated, too :p).
Context:
I invoke a complex series of scripts via SSH, e.g. ssh user@host -- /my/command
. The remote command does a lot of complex forking and execcing and eventually results in a backgrounded daemon process running on the remote host. Occasionally (I'm slowly going mad trying to find out reliable reproduction conditions), the ssh
command will never return control to the client shell. In those situations, I can go onto the target host and see an sshd: user@notty
process with no children hanging indefinitely.
Fixing that issue is not what this question is about. This question is about what that sshd
process is doing.
The SSH implementation is OpenSSH, and the version version is 5.3p1-112.el6_7.
The problem:
If I find one of those stuck sshd
s and strace
it, I can see it's doing a select on two handles, e.g. select(12, [3 6], [], NULL, NULL
or similar. lsof
tells me that one of those handles is the TCP socket connecting back to the SSH client. The other is a pipe, the other end of which is only open in the same sshd
process. If I search for that pipe by ID using the answer to this SuperUser question, the only process that contains references to that pipe is the same process. lsof
confirms this: both the read and write ends of the pipe are open in the same process, e.g. (for pipe 788422703 and sshd
PID 22744):
sshd 22744 user 6r FIFO 0,8 0t0 788422703 pipe
sshd 22744 user 7w FIFO 0,8 0t0 788422703 pipe
Questions:
What is SSH waiting for? If the pipe isn't connected to anything and there are no child processes, I can't imagine what event it could be expecting.
What is that "looped" pipe/what does it represent? My only theory is that maybe if STDIN isn't supplied to the SSH client, the target host sshd
opens a dummy STDIN pipe so some of its internal child-management code can be more uniform? But that seems pretty tenuous.
How does SSH get into this situation?
What I've Tried/Additional Info:
sshd
process by issuing a command that backgrounds itself, e.g. ssh user@host -- 'sleep 60 &'
; sshd
will wait for the streams to be closed to the daemonized process; not just the exit of its immediate child. Since the scripts in question eventually result (way down the process tree) in a daemon being started, it initially seemed possible that the daemon was holding onto a handle. However, that doesn't seem to hold up--using the sleep 60 &
command as an example, sshd
processes communicating with daemons hold and select on four open pipes, not just two, and at least two of the pipes are connected from sshd
to the daemon process, not looped. Unless there's a method of tracking/pointing to a pipe I don't know about (and there likely is--for example, I have no idea how dup
ed filehandles play into close()
semaphore waiting or piping), I don't think the pipe-to-self situation represents a waiting-on-daemon case.sshd
periodically receives communication on the TCP socket/ssh connection itself, which wakes it up out of the select
s for a brief period of communication (during which strace
shows it blocking SIGCHLD), and then it goes back to waiting on the same FDs.It seems that you're describing the notify pipe. The OpenSSH sshd main loop calls select()
to wait until it has something to do. The file descriptors being polled include the TCP connection to the client and any descriptors used to service active channels.
sshd wants to be able to interrupt the select() call when a SIGCHLD signal is received. To do that, sshd installs a signal handler for SIGCHLD and it creates a pipe. When a SIGCHLD signal is received, the signal handler writes a byte into the pipe. The read end of the pipe is included in the list of file descriptors polled by select(). The act of writing to the pipe would cause the select() call to return with an indication that the notify pipe is readable.
All of the code is in serverloop.c
:
/*
* we write to this pipe if a SIGCHLD is caught in order to avoid
* the race between select() and child_terminated
*/
static int notify_pipe[2];
static void
notify_setup(void)
{
if (pipe(notify_pipe) < 0) {
error("pipe(notify_pipe) failed %s", strerror(errno));
} else if ((fcntl(notify_pipe[0], F_SETFD, 1) == -1) ||
(fcntl(notify_pipe[1], F_SETFD, 1) == -1)) {
error("fcntl(notify_pipe, F_SETFD) failed %s", strerror(errno));
close(notify_pipe[0]);
close(notify_pipe[1]);
} else {
set_nonblock(notify_pipe[0]);
set_nonblock(notify_pipe[1]);
return;
}
notify_pipe[0] = -1; /* read end */
notify_pipe[1] = -1; /* write end */
}
static void
notify_parent(void)
{
if (notify_pipe[1] != -1)
write(notify_pipe[1], "", 1);
}
[...]
/*ARGSUSED*/
static void
sigchld_handler(int sig)
{
int save_errno = errno;
child_terminated = 1;
#ifndef _UNICOS
mysignal(SIGCHLD, sigchld_handler);
#endif
notify_parent();
errno = save_errno;
}
The code to set up and perform the select call is in another function called wait_until_can_do_something()
. It's fairly long so I won't include it here. OpenSSH is open source, and this page describes how to download the source code.