Search code examples
pythontensorflowparallel-processingmultiprocessingdistributed-computing

Why should one use tf.train.Server to execute multiple tf.Session() in parallel?


The official way to execute multiple tf.Session() in parallel is to use tf.train.Server as described in Distributed TensorFlow . On the other hand, the following works for Keras and can be modified to Tensorflow presumably without using tf.train.Server according to Keras + Tensorflow and Multiprocessing in Python.

def _training_worker(train_params):
    import keras
    model = obtain_model(train_params)
    model.fit(train_params)
    send_message_to_main_process(...)

def train_new_model(train_params):
    training_process = multiprocessing.Process(target=_training_worker, args = train_params)
    training_process.start()
    get_message_from_training_process(...)
    training_process.join()

Is the first method faster than the second method? I have a code written in the second way, and due to the nature of my algorithm (AlphaZero) a single GPU is supposed to run many processes, each of which performs prediction of tiny minibatch.


Solution

  • tf.train.Server is designed for distributed computation within a cluster, when there is a need to communicate between different nodes. This is especially useful when training is distributed across multiple machines or in some cases across multiple GPUs on a single machine. From the documentation:

    An in-process TensorFlow server, for use in distributed training.

    A tf.train.Server instance encapsulates a set of devices and a tf.Session target that can participate in distributed training. A server belongs to a cluster (specified by a tf.train.ClusterSpec), and corresponds to a particular task in a named job. The server can communicate with any other server in the same cluster.

    Spawning multiple processes with multiprocessing.Process isn't a cluster in Tensorflow sense, because the child processes aren't interacting with each other. This method is easier to setup, but it's limited to a single machine. Since you say you have just one machine, this might not be a strong argument, but if you ever plan to scale to a cluster of machines, you'll have to redesign the whole approach.

    tf.train.Server is thus a more universal and scalable solution. Besides, it allows to organize complex training with some non-trivial communications, e.g., async gradient updates. Whether it is faster to train or not greatly depends on a task, I don't think there will be a significant difference on one shared GPU.

    Just for the reference, here's how the code looks like with the server (between graph replication example):

    # specify the cluster's architecture
    cluster = tf.train.ClusterSpec({
      'ps': ['192.168.1.1:1111'],
      'worker': ['192.168.1.2:1111',
                 '192.168.1.3:1111']
    })
    
    # parse command-line to specify machine
    job_type = sys.argv[1]  # job type: "worker" or "ps"
    task_idx = sys.argv[2]  # index job in the worker or ps list as defined in the ClusterSpec
    
    # create TensorFlow Server. This is how the machines communicate.
    server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx)
    
    # parameter server is updated by remote clients.
    # will not proceed beyond this if statement.
    if job_type == 'ps':
      server.join()
    else:
      # workers only
      with tf.device(tf.train.replica_device_setter(worker_device='/job:worker/task:' + task_idx,
                                                    cluster=cluster)):
        # build your model here as if you only were using a single machine
        pass
    
      with tf.Session(server.target):
        # train your model here
        pass