python asynchronous tensorflow-serving sanic

Why would this TensorFlow Serving gRPC call hang?

We have a fairly complicated system that stitches together different data sources to make product recommendations for our users. Among the components is often a call out to one or more TensorFlow Serving models that we have running. This has been fine, even under load, until recently some of our final REST APIs (using the Sanic framework) now sometimes take over 10 seconds to return.

Using cProfile, it appears that the problem the gRPC channel hanging. But it appears isolated to something in our final web serving layer. When I run the code below for the TensorFlow Serving component separately, it breezes through a series of random inputs without any issues.

Here's the code we're running, with some specific details removed:

def get_tf_serving(model_uri, model_name, input_name, output_name, X):
    channel = grpc.insecure_channel(model_uri, options=MESSAGE_OPTIONS)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    request = predict_pb2.PredictRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = 'serving_default'
    request.inputs[input_name].CopyFrom(util.make_tensor_proto(X.astype(np.float32), shape=X.shape))
    result = stub.Predict(request, 4.0)
    channel.close()

    # imagine more stuff here doing something with the returned data
    data = result.outputs[output_name].float_val    

    return data

This is called by another function, which is ultimately called by a route that will look something like this:

@doc.include(True)
async def get_suggestions(request):
    user_id = request.args.get('user_id', 'null')
    count = int(request.args.get('count', 10))

    data = # something that itself calls `get_tf_serving`

    return data

Is there something basic I'm missing here? Why would these requests suddenly take so long and hang when there's no apparent load issue with the TensorFlow Serving service?

Just to double-check we actually quickly re-implemented one of these routes in FastAPI and while it was maybe a little better, the timeouts still kept happening.

Update: as another test, we reimplemented everything using the TensorFlow Serving REST HTTP API instead. Lo and behold, the problem completely disappeared. I feel like the gRPC should be better, though. Still can't figure out why that was hanging.

Solution

The issue here was not the TensorFlow Serving setup or the Python code, but the way the networking between the two parts was configured. The TensorFlow Serving instances were orchestrated by Kubernetes, and then stitched together using a Kubernetes service. It was that service that the Python code called, and the poor configuration that was causing the timeout.

This post on the Kubernetes blog explains the details. In a nutshell, because gRPC depends on HTTP/2, it runs into some problems with the standard Kubernetes services due to the multiplexing that is otherwise one of the advantageous features of gRPC.

The solution, also in the same blog post, is to set up a more sophisticated network object to mediate connections to the TensorFlow Serving instances.