Search code examples
pythonazureazure-machine-learning-service

Restrict the number of nodes used by an Azure Machine Learning pipeine


I have written a pipeline that I want to run on a remote compute cluster within Azure Machine Learning. My aim is to process a large amount of historical data, and to do this I will need to run the pipeline on a large number of input parameter combinations.

Is there a way to restrict the number of nodes that the pipeline uses on the cluster? By default it will use all the nodes available to the cluster, and I would like to restrict it so that it only uses a pre-defined maximum. This allows me to leave the rest of the cluster free for other users.

My current code to start the pipeline looks like this:

# Setup the pipeline
steps = [data_import_step] # Contains PythonScriptStep
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

# Big long list of historical dates that I want to process data for
dts = pd.date_range('2019-01-01', '2020-01-01', freq='6H', closed='left')
# Submit the pipeline job
for dt in dts:
    pipeline_run = Experiment(ws, 'my-pipeline-run').submit(
        pipeline,
        pipeline_parameters={
            'import_datetime': dt.strftime('%Y-%m-%dT%H:00'),
        }
    )

Solution

  • For me, the killer feature of Azure ML is not having to worry about load balancing like this. Our team has a compute target with max_nodes=100 for every feature branch and we have Hyperdrive pipelines that result in 130 runs for each pipeline.

    We can submit multiple PipelineRuns back-to-back and the orchestrator does the heavy lifting of queuing, submitting, all the runs so that the PipelineRuns execute in the serial order I submitted them, and that the cluster is never overloaded. This works without issue for us 99% of the time.

    If what you're looking for is that you'd like the PipelineRuns to be executed in parallel, then you should check out ParallelRunStep.

    Another option is to isolate your computes. You can have up to 200 ComputeTargets per workspace. Two 50-node ComputeTargets cost the same as one 100-node ComputeTarget.

    On our team, we use pygit2 to have a ComputeTarget created for each feature branch, so that, as data scientists, we can be confident that we're not stepping on our coworkers' toes.