Search code examples
javagrpcgrpc-java

gRPC client side reconnection logic causing duplicate streams to be opened server side


I have a gRPC client that uses two bidi streams. For reasons unknown at the present, when we send a keepAlive ping every hour, onError with a statusRuntimeException are called on both streams.

To handle the reconnection, I've implemented the following retry mechanism, in java pseudocode. I will clarify anything as necessary in the comments.

The mechanism looks like so:

onError() {
    retrySyncStream();
}


void retrySyncStream() {
    // capture the current StreamObserver
    previousStream = this.StreamObserver;

    // open a new stream
    streamObserver = bidiStub.startStream(responseObserver);

    waitForChannelReady(); // <-- simplified version, we use the gRPC notification listener 

    previousStream.onCompleted(); // <-- called on notify of channel READY

}

Although we attempt to close the old stream, server side we see 2 connections open on 2 HA nodes. I don't have control over anything server side, I just need to handle reconnection logic on the client.

First things, is this common practice to ditch the old StreamObserver after getting a StatusRuntimeException? The reason I am doing this is because we have a mock server spring boot application that we use to test our client against. When I do a force shutdown (Ctl-c) the spring boot server app, and start it back up again, the client can't use the original StreamObserver, it has to create a new one by calling the gRPC bidi stream API call.

From what I've read online, people say not to ditch the managed channel, but how about stream observers, and making sure that multiple streams aren't being opened by mistake? Thanks.


Solution

  • When the StreamObserver gets an error, the RPC is dead. It is appropriate to ditch it.

    When you re-create the stream, consider what would happen if the server is having trouble. Generally you'd have exponential backoff in place somewhere. For bidi streaming cases, several cases in gRPC tend to reset the backoff if the client received a response from the server.

    Since both streams die together, it sounds like the TCP connection was dead. Unfortunately, in TCP you have to send on the connection to learn it is dead. When the client learns the connection is dead, it learns that because it can't send to the HA proxy using that connection. That means HA has to separately discover the connection is dead. Server-side keepalive could help with this, although TCP keepalive at HA is probably also warranted.