Search code examples
azurempicluster-computingazure-machine-learning-service

Submitting multiple runs to the same node on AzureML


I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.

The way I currently submit jobs is the following (resulting in one training run per GPU/node):

experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
                         script="train.py",
                         compute_target="gpu_cluster",
                         environment="env_name",
                         arguments=["--args args"])
run = experiment.submit(config)

ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:

Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407

Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.

I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?


Solution

  • Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.

    For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run. So there would be 1 execution per node.

    single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score. or with the new ML Endpoints (Preview). What are endpoints (preview) - Azure Machine Learning | Microsoft Docs