python tensorflow parallel-processing multiprocessing distributed-computing

Why should one use tf.train.Server to execute multiple tf.Session() in parallel?

The official way to execute multiple tf.Session() in parallel is to use tf.train.Server as described in Distributed TensorFlow . On the other hand, the following works for Keras and can be modified to Tensorflow presumably without using tf.train.Server according to Keras + Tensorflow and Multiprocessing in Python.

def _training_worker(train_params):
    import keras
    model = obtain_model(train_params)
    model.fit(train_params)
    send_message_to_main_process(...)

def train_new_model(train_params):
    training_process = multiprocessing.Process(target=_training_worker, args = train_params)
    training_process.start()
    get_message_from_training_process(...)
    training_process.join()

Is the first method faster than the second method? I have a code written in the second way, and due to the nature of my algorithm (AlphaZero) a single GPU is supposed to run many processes, each of which performs prediction of tiny minibatch.

Solution

tf.train.Server is designed for distributed computation within a cluster, when there is a need to communicate between different nodes. This is especially useful when training is distributed across multiple machines or in some cases across multiple GPUs on a single machine. From the documentation:

An in-process TensorFlow server, for use in distributed training.

A tf.train.Server instance encapsulates a set of devices and a tf.Session target that can participate in distributed training. A server belongs to a cluster (specified by a tf.train.ClusterSpec), and corresponds to a particular task in a named job. The server can communicate with any other server in the same cluster.

Spawning multiple processes with multiprocessing.Process isn't a cluster in Tensorflow sense, because the child processes aren't interacting with each other. This method is easier to setup, but it's limited to a single machine. Since you say you have just one machine, this might not be a strong argument, but if you ever plan to scale to a cluster of machines, you'll have to redesign the whole approach.

tf.train.Server is thus a more universal and scalable solution. Besides, it allows to organize complex training with some non-trivial communications, e.g., async gradient updates. Whether it is faster to train or not greatly depends on a task, I don't think there will be a significant difference on one shared GPU.

Just for the reference, here's how the code looks like with the server (between graph replication example):

# specify the cluster's architecture
cluster = tf.train.ClusterSpec({
  'ps': ['192.168.1.1:1111'],
  'worker': ['192.168.1.2:1111',
             '192.168.1.3:1111']
})

# parse command-line to specify machine
job_type = sys.argv[1]  # job type: "worker" or "ps"
task_idx = sys.argv[2]  # index job in the worker or ps list as defined in the ClusterSpec

# create TensorFlow Server. This is how the machines communicate.
server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx)

# parameter server is updated by remote clients.
# will not proceed beyond this if statement.
if job_type == 'ps':
  server.join()
else:
  # workers only
  with tf.device(tf.train.replica_device_setter(worker_device='/job:worker/task:' + task_idx,
                                                cluster=cluster)):
    # build your model here as if you only were using a single machine
    pass

  with tf.Session(server.target):
    # train your model here
    pass