I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.
However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.
Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?
I was able to make it create and use gRPC sessions (instead of local) with some hacks:
tensorflow/core/distributed_runtime/rpc:grpc_session
target publicly visible (it's internal to tensorflow package by default) by modifying BUILD
file.tensorflow_serving/model_servers:tensorflow_model_server
target.tensorflow_model_server
called --session_target
which sets up session_bundle_config.session_target()
in main.cc
.--session_target=grpc://localhost:12345
, where localhost:12345
is an arbitrary node which will be used to create master sessions.However, this set of hacks does not look enough for "real-world usage" for three reasons:
grpc_session
target is probably internal for a reason.Exporter
/Saver
at all.tensorflow_model_server
creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.All in all, it looks like this scenario is not officially supported yet, but I'm not sure.
If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.
That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.
Regarding questions:
Worker assignments are done through device
field in graph .pbtxt
. Some importers/exporters clear those assignments and have clear_devices
flag. You could open graph definition (.pbtxt
file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device
strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession
in tf.train
, but you need to handle this yourself with serving.