I use autossh to create a remote tunnel with the following command (IPs and Port Changed):
autossh -M 0 -o "ServerAliveInterval 5" -o "ServerAliveCountMax 3" -f -T -N -i /root/.ssh/id_rsa -R 1602:localhost:443 [email protected]
And the server has this config in sshd:
GatewayPort yes
ClientAliveInterval 10
ClientAliveCountMax 6
This works most of the time like a charm. Also timeouts and disconnects get handled very well.
But there is one exception:
If there is only a very short interruption of the network connection – the client notice this and start a reconnect. But the server hasn’t noticed this yet and still uses this port 1602. I can then see in server log the message: sshd[431646]: error: bind [::]:1602: Address already in use.
But autossh does not hang up and try again, it keeps the not working tunnel open. A few seconds later, the server recognise the disconnect of the old tunnel and frees the port 1602.
Now I have a autossh/ssh tunnel – doing all the watchdog stuff (I can see in log this keep alive message all 5 seconds) and staying alive. The port on the server is now unused. And the tunnel is not working, because the port is not allocated at all now.
Autossh does not recover from this state without manual interaction. There are multiple ways to recover manually, but this is not the question.
My questions are:
Im searching for a way to automatically recover from this state. And I wonder why this race condition is not mentioned in any place of the internet, even if it can be reproduced easily.
You need to add -o "ExitOnForwardFailure yes" as an autossh option.