Search code examples
azure-machine-learning-service

AzureML ParallelRunStep runs only on one node


I have an inference pipeline with some PythonScriptStep with a ParallelRunStep in the middle. Everything works fine except for the fact that all mini batches are run on one node during the ParallelRunStep, no matter how many nodes I put in the node_count config argument.

All the nodes seem to be up and running in the cluster, and according to the logs the init() function has been run on them multiple times. Diving into the logs I can see in sys/error/10.0.0.* that all the workers except the one that is working are saying:

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/batch/tasks/shared/LS_root/jobs/virtualstage/azureml/c36eb050-adc9-4c34-8a33-5f6d42dcb19c/wd/tmp8_txakpm/bg.png'

bg.png happens to be a side argument created in a previous PythonScriptStep that I'm passing to the ParallelRunStep:

bg_file = PipelineData('bg',  datastore=data_store)
bg_file_ds = bg_file.as_dataset()
bg_file_named = bg_file_ds.as_named_input("bg")
bg_file_dw = bg_file_named.as_download()

...

parallelrun_step = ParallelRunStep(
    name='batch-inference',
    parallel_run_config=parallel_run_config,
    inputs=[frames_data_named.as_download()],
    arguments=["--bg_folder", bg_file_dw],
    side_inputs=[bg_file_dw],
    output=inference_frames_ds,
    allow_reuse=True
)

What's happening here? Why the side argument seems to be available only in one worker while it fails in the others?

BTW I found this similar but unresolved question.

Any help is much appreciated, thanks!


Solution

  • Apparently you need to specify a local mount path to use side_inputs in more than one node:

    bg_file_named = bg_file_ds.as_named_input(f"bg")
    bg_file_mnt = bg_file_named.as_mount(f"/tmp/{str(uuid.uuid4())}")
    
    ...
    
    parallelrun_step = ParallelRunStep(
        name='batch-inference',
        parallel_run_config=parallel_run_config,
        inputs=[frames_data_named.as_download()],
        arguments=["--bg_folder", bg_file_mnt],
        side_inputs=[bg_file_mnt],
        output=inference_frames_ds,
        allow_reuse=True
    )
    

    Sources: