Search code examples
azureazure-machine-learning-serviceazure-sdk-pythonazureml-python-sdk

Use dataset registed in on pipelines in AML


I was following the SDK v2 Python tutorial in order to create a pipeline job with my own assets. I notice that in this tutorial they let you use a csv file that can be downloaded but Im trying to use a registered dataset that I already registered by my own. The problem that I facing is that I dont know where I need to specify the dataset.

The funny part is that at the beginning they create this dataset like this:

credit_data = ml_client.data.create_or_update(credit_data)
print(
    f"Dataset with name {credit_data.name} was registered to workspace, the dataset version is {credit_data.version}"
)

But the only part where they refer to this dataset is on the last part where they # the line:

registered_model_name = "credit_defaults_model"

# Let's instantiate the pipeline with the parameters of our choice
pipeline = credit_defaults_pipeline(
    # pipeline_job_data_input=credit_data,
    pipeline_job_data_input=Input(type="uri_file", path=web_path),
    pipeline_job_test_train_ratio=0.2,
    pipeline_job_learning_rate=0.25,
    pipeline_job_registered_model_name=registered_model_name,
)

For me this means that I can use this data like this (a already registered dataset), the problem is that I don't know where I need to do the changes (I know that in the data_prep.py and in the code below but I don´t know where else) and I don't know how to set this:

%%writefile {data_prep_src_dir}/data_prep.py
...

def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data") # <=== Here, but I don´t know how
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--train_data", type=str, help="path to train data")
    parser.add_argument("--test_data", type=str, help="path to test data")
    args = parser.parse_args()

...

Does anyone have experience working as registered datasets?


Solution

  • parser.add_argument("--data", type=str, help="path to input data") # <=== Here, but I don´t know how

    To get the path to input data, according to documentation:

    • You can get --input-data by ID which you can access in your training script.

    • Use it as argument on mounted_input_path

    For example, try the following three code snippets taken from the documentation:

    Access dataset in training script:

    parser = argparse.ArgumentParser()
    parser.add_argument("--input-data", type=str)
    args = parser.parse_args()
    
    run = Run.get_context()
    ws = run.experiment.workspace
    
    # get the input dataset by ID
    dataset = Dataset.get_by_id(ws, id=args.input_data)
    

    Configure the training run:

    src = ScriptRunConfig(source_directory=script_folder,
                          script='train_titanic.py',
                          # pass dataset as an input with friendly name 'titanic'
                          arguments=['--input-data', titanic_ds.as_named_input('titanic')],
                          compute_target=compute_target,
                          environment=myenv)
    

    Pass mounted_input_path as argument:

    mounted_input_path = sys.argv[1]
    mounted_output_path = sys.argv[2]
    
    print("Argument 1: %s" % mounted_input_path)
    print("Argument 2: %s" % mounted_output_path)
    

    References: How to create register dataset and How to use configure a training run with data input and output