Search code examples
azureazure-machine-learning-service

Move data from one component to the next in Azure Machine Learning


I have got 2 components in Azure Machine Learning. I have got 2 dataframes in the first component (called prep) which I want to pass into the next component (called middle) for further processing.

In the prep code, I have tried to save the dataframe into the component's output section, into a datastore and into the args location passed in as input parameters. As shown below:

print((Path(args.Y_df) / "Y_df.csv"))
df1.to_csv("./outputs/Y_df.csv")
 df1.to_csv(args.Y_df.path)
 df1.to_csv("azureml://subscriptions/subscription_id/resourcegroups/rg_group/workspaces/workspace_name/datastores/datastore_name/paths/azureml/forecast/testing/y_df.csv")

Out of these only the first method works. Now I want to pass this into the next component. So in the pipeline definition code, I have mentioned this:

def data_pipeline(
    compute_train_node: str,
):

    prep_node = prep()
    transform_node = middle(Y_df=prep_node.outputs.Y_df,
                            S_df=prep_node.outputs.S_df)

I am trying to run a basic code in the middle component but it just does not get started. It fails with the following error:

enter image description here

Below are YAMLS for prep and middle: middle:

name: middle4 display_name: middle4

inputs:   Y_df:
    type: uri_file   S_df:
    type: uri_file

code: ./middle

environment: azureml:environment_name:4

command: >-   python middle_script.py   --Y_df ${{inputs.Y_df}}   
--S_df ${{inputs.S_df}}

prep:

name: preprocessing24
display_name: preprocessing24

outputs:
  Y_df:
    type: uri_file

  S_df:
    type: uri_file

code: ./preprocessing

environment: azureml:environment_name:4

command: >-
  python preprocessing_script.py
  --Y_df ${{outputs.Y_df}} 
  --S_df ${{outputs.S_df}}

What am I doing wrong? How do I pass file from one component to the other?

Edit after trying out the method in the answer:

As of now, args.Y_df points to some random (probably default) file path instead of the one I have given it as part of the Output() function as mentioned in the answer. It then gives an error saying

OSError: Cannot save file into a non-existent directory: '/mnt/azureml/cr/j/32h438dshj537dj284ndhs630e1/cap/data-capability/wd/Y_df/testing'

Below is the code I have written for getting the path into the prep code. This path is used to save the dataframes as csv.

parser = argparse.ArgumentParser("prep")
parser.add_argument("--Y_df", type=str, help="Path of prepped data")
parser.add_argument("--S_df", type=str, help="Path of prepped data")
parser.add_argument("--clinical_actuals_path", type=str, help="Path of prepped data")
args = parser.parse_args()

Solution

  • Answering, based on all the information provided by JayashankarGS above. His method is what solved almost the entire issue and I just added one extra parameter to the code that he has provided.

    from  azure.ai.ml  import  MLClient, Input, Output
    
    def data_pipeline(
    compute_train_node: str,
    ):
    
    prep_node = prep()
    
    prep_node.outputs.Y_df= Output(type="uri_folder", mode = 'rw_mount', path="azureml://datastores/<datastore_name>/paths/csvs/Y_df/")
    prep_node.outputs.S_df= Output(type="uri_folder", mode = 'rw_mount', path="azureml://datastores/<datastore_name>/paths/csvs/S_df/")
    
    transform_node = middle(Y_df=prep_node.outputs.Y_df,
                            S_df=prep_node.outputs.S_df)
    

    This is the same code that JayashankarGS has posted, I just added another parameter in the Output() function

    mode = 'rw_mount'
    

    This solved all the issues.