Search code examples
rtcprserve

Troubleshoot RServe config option keep.alive


I am using RServe 1.7.3 on a headless RHEL 7.9 VM. On the client, I am using RserveCLI2.

On long running jobs, the TCP/IP connection becomes blocked by a fire wall, after 2 hours.

I came across the keep.alive configuration option, that is available since RServe 1.7.2 (RServe News/Changelog).

The specs read:

added support for keep.alive configuration option - it is global to all servers and if enabled the client sockets are instructed to keep the connection alive by periodic messages.

I added the following to /etc/Rserv.conf:

keep.alive enable

but this does no prevent the connection from being blocked.

Unfortunately, I cannot run a network monitoring tool, like Wireshark, to monitor the traffic between client and server.

How could I troubleshoot this?

Some specific questions I have:

  1. Is the path of the config file indeed /etc/Rserv.conf, as specified in Documentation for Rserve? Notice that it does not have a final e, like Rserve.
  2. Does this behaviour depend on de RServe client in use, or is this completely handled at the socket level?
  3. Can I inspect the runtime settings of RServe, to see if keep.alive is enabled?

Solution

  • We got this to work.

    To summarize, we adjusted some kernel settings to make sure keep-alive packets are send at shorter intervals to prevent the connection from being deemed dead by network components.

    This is how and why.

    The keep.alive enable setting is in fact an instruction to the socket layer to periodically emit keep-alive packets from server to client. The client is expected to return an ACK on these packets. The behaviour is governed by three kernel-level settings, as explained in TCP Keepalive HOWTO - Using TCP keepalive under Linux:

    1. tcp_keepalive_time (defaults to 7200 seconds)
    2. tcp_keepalive_intvl (defaults to 75 seconds)
    3. tcp_keepalive_probes (defaults to 9 times)

    The tcp_keepalive_time is the first time a keep-alive packet is sent, after establishing the tcp/ip connection. The tcp_keepalive_intvl interval is de wait time between subsequent packets and tcp_keepalive_probes the number of subsequent unacknowledged packets that make the system decide the connection is dead.

    So, the first keep-alive packet was only send after 2 hours. After that time, some network component had already decided the connection was dead and the keep-alive packet never made it to the client and thus no ACK was ever send.

    We lowered both tcp_keepalive_time and tcp_keepalive_intvl to 600 seconds.

    With tcpdump -i [interface] port 6311 we were able to monitor the keep-alive packets.

    15:40:11.225941 IP <server>.6311 <some node>.<port>: Flags [.], ack 1576, win 237, length 0 15:40:11.226196 IP <some node>.<port> <server>.6311: Flags [.], ack 401, win 511, length 0

    This continues until the results are send back and the connection is closed. At least, I test for a duration of 12 hours.

    So, we use keep-alive here not to check for dead peers, but to prevent disconnection due to network inactivity, as is discussed in TCP Keepalive HOWTO - 2.2. Why use TCP keepalive?. In that scenario, you want to use low values for keep-alive time and interval.

    Note that these are kernel level settings, and thus are applied system-wide. We use a dedicated server, so this is no issue for us, but may be in other cases.

    Finally, for completeness, I'll answer my own three questions.

    1. The path of the the configuration is /etc/Rserv.conf, as was confirmed by changing another setting (remoted enable to remote disable).
    2. This is handled a the socket level.
    3. I am not sure, but using tcpdump shows that Rserve emits keep-alive packets, which is a more useful way to inspect what's happening.