tensorflow tensorflow-serving horizontal-scaling

Is it possible to use TensorFlow Serving with distributed TensorFlow cluster to improve throughput/latency?

I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.

However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.

Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?

I was able to make it create and use gRPC sessions (instead of local) with some hacks:

Make tensorflow/core/distributed_runtime/rpc:grpc_session target publicly visible (it's internal to tensorflow package by default) by modifying BUILD file.
Add it as a dependency to the tensorflow_serving/model_servers:tensorflow_model_server target.
Add an extra flag to tensorflow_model_server called --session_target which sets up session_bundle_config.session_target() in main.cc.
Run the binary with --session_target=grpc://localhost:12345, where localhost:12345 is an arbitrary node which will be used to create master sessions.
See my cluster performing some computations on behalf of TensorFlow Serving.

However, this set of hacks does not look enough for "real-world usage" for three reasons:

grpc_session target is probably internal for a reason.
As noticed in my other question, distributed TensorFlow works better when computations are manually "pinned" to specific machines. So, if we use TensorFlow Serving, we need a way to save those "pins" and model's structure becomes tied with cluster's structure. I'm not sure whether this information is exported with Exporter/Saver at all.
tensorflow_model_server creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.

All in all, it looks like this scenario is not officially supported yet, but I'm not sure.

Solution

If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.

That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.

Regarding questions:

Worker assignments are done through device field in graph .pbtxt. Some importers/exporters clear those assignments and have clear_devices flag. You could open graph definition (.pbtxt file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession in tf.train, but you need to handle this yourself with serving.