Search code examples
azure-machine-learning-service

Azure ML: How to train a model on multiple instances


I have a AML compute cluster with the min & max nodes set to 2. When I execute a pipeline, I expect the cluster to run the training on both instances in parallel. But the cluster status reports that only one node is busy and the other is idle.

Here's my code to submit the pipeline, as you can see, I'm resolving the cluster name and passing that to my Step1, thats training a model on Keras.

aml_compute = AmlCompute(ws, "cluster-name")
step1 = PythonScriptStep(name="train_step",
                         script_name="Train.py", 
                         arguments=["--sourceDir", os.path.realpath(source_directory) ],
                         compute_target=aml_compute, 
                         source_directory=source_directory,
                         runconfig=run_config,
                         allow_reuse=False)
pipeline_run = Experiment(ws, 'MyExperiment').submit(pipeline1, regenerate_outputs=False)

Solution

  • Each python script step runs on a single node even if you allocate multiple nodes in your cluster. I'm not sure whether training on different instances is possible off-the-shelf in AML, but there's definitely the possibility to use that single node more effectively (looking into using all your cores, etc.)