Search code examples
tensorflowgoogle-cloud-platformgoogle-cloud-mltensorflow-estimatorgcp-ai-platform-training

MultiWorkerMirroredStrategy() not working on Google AI-Platform (CMLE)


I'm getting the following error while using MultiWorkerMirroredStrategy() for training Custom Estimator on Google AI-Platform (CMLE).

ValueError: Unrecognized task_type: 'master', valid task types are: "chief", "worker", "evaluator" and "ps".

Both MirroredStrategy() and PamameterServerStrategy() are working fine on AI-Platform with their respective config.yaml files. I'm currently not providing device scopes for any operations. Neither I'm providing any device filter in session config, tf.ConfigProto(device_filters=device_filters).

The config.yaml file which I'm using for training with MultiWorkerMirroredStrategy() is:

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerType: standard_gpu
  workerCount: 4

The masterType input is mandatory for submitting the training job on AI-Platform.

Note: It's showing 'chief' as a valid task type and 'master' as invalid. I'm providing tensorflow-gpu==1.14.0 in setup.py for trainer package.


Solution

  • (1) This appears to be a bug then with MultiWorkerMirroredStrategy. Please file a bug in TensorFlow. In TensorFlow 1.x, it should be using master and in TensorFlow 2.x, it should be using chief. The code is (wrongly) asking for chief, and AI Platform (because you are using 1.14) is providing only master. Incidentally: master = chief + evaluator.

    (2) Do not have add tensorflow to your setup.py. Provide the tensorflow framework you want AI Platform to use using the --runtime-version (See https://cloud.google.com/ml-engine/docs/runtime-version-list) flag to gcloud.