Search code examples
tensorflow-servinggoogle-cloud-mltensorflowgoogle-cloud-ml-engine

Running distributed Tensorflow on Google Cloud ML engine ClusterSpec


I am trying to run a large distributed tensorflow model on Google Cloud's ML engine and am having trouble understanding what should go on tf.train.ClusterSpec.

When you run a job on Google Cloud you can select the scale tier from BASIC, STANDARD_1, PREMIUM_1, BASIC_GPU or CUSTOM, each giving you access to different types of clusters. However, I can't find the name/addresses of the machines in these clusters.


Solution

  • Please take a look at the documentation and sample here. You should set ClusterSpec using the environment variable TF_CONFIG; e.g.

      tf_config = os.environ.get('TF_CONFIG')
    
      # If TF_CONFIG is not available run local
      if not tf_config:
        return run('', True, *args, **kwargs)
    
      tf_config_json = json.loads(tf_config)
      cluster = tf_config_json.get('cluster')
      ...
      cluster_spec = tf.train.ClusterSpec(cluster)