Search code examples
distributed-computinggoogle-cloud-mlgoogle-cloud-ml-engine

Run a tensorflow code in distributed mode on google cloud ML


Does anybody know what changes need to be made to trainer in order to run a job on distributed platform on google cloud ML ?

It will of great help if somebody can share few articles or docs about the same.


Solution

  • By and large, your distributed TensorFlow program will be exactly that -- distributed TensorFlow, with minimal -- or even no -- cloud-specific changes. The best resource for distributed TensorFlow is this tutorial on tensorflow.org. The tutorial walks you through the low-level way of doing things.

    There is also a higher-level API, currently in contrib (so API may change and will move out of contrib in a future version), that simplifies the amount of boilerplate code you have to write for distributed training. The official tutorial is here.

    Once you've understood the general TensorFlow bits (whether high-level or low-level APIs), there are some specific elements that must be present in your code to get it to run on CloudML Engine. In the case of the low-level TensorFlow APIs, you'll need to parse the TF_CONFIG environment variable to setup your ClusterSpec. This is exemplified in this example (see specifically this block of code).

    One advantage of the higher-level APIs, is that all of that parsing is already taken care of for you. Your code should just generally work. See this example. The important piece is that you will need to use learn_runner.run() (see this line), which will work locally and in the cloud to train your model.

    Of course, there are other frameworks as well, e.g., TensorFX.

    After you've structured your code appropriately, then you simply select an appropriate scale tier that has multiple machines when launching your training job. (See Chuck Finley's answer for an example)

    Hope it helps!