I was following the SDK v2 Python tutorial in order to create a pipeline job with my own assets. I notice that in this tutorial they let you use a csv file that can be downloaded but Im trying to use a registered dataset that I already registered by my own. The problem that I facing is that I dont know where I need to specify the dataset.
The funny part is that at the beginning they create this dataset like this:
credit_data = ml_client.data.create_or_update(credit_data)
print(
f"Dataset with name {credit_data.name} was registered to workspace, the dataset version is {credit_data.version}"
)
But the only part where they refer to this dataset is on the last part where they # the line:
registered_model_name = "credit_defaults_model"
# Let's instantiate the pipeline with the parameters of our choice
pipeline = credit_defaults_pipeline(
# pipeline_job_data_input=credit_data,
pipeline_job_data_input=Input(type="uri_file", path=web_path),
pipeline_job_test_train_ratio=0.2,
pipeline_job_learning_rate=0.25,
pipeline_job_registered_model_name=registered_model_name,
)
For me this means that I can use this data like this (a already registered dataset), the problem is that I don't know where I need to do the changes (I know that in the data_prep.py and in the code below but I don´t know where else) and I don't know how to set this:
%%writefile {data_prep_src_dir}/data_prep.py
...
def main():
"""Main function of the script."""
# input and output arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data", type=str, help="path to input data") # <=== Here, but I don´t know how
parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
parser.add_argument("--train_data", type=str, help="path to train data")
parser.add_argument("--test_data", type=str, help="path to test data")
args = parser.parse_args()
...
Does anyone have experience working as registered datasets?
parser.add_argument("--data", type=str, help="path to input data") # <=== Here, but I don´t know how
To get the path to input data, according to documentation:
You can get --input-data
by ID which you can access in your training script.
Use it as argument
on mounted_input_path
For example, try the following three code snippets taken from the documentation:
Access dataset in training script:
parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str)
args = parser.parse_args()
run = Run.get_context()
ws = run.experiment.workspace
# get the input dataset by ID
dataset = Dataset.get_by_id(ws, id=args.input_data)
Configure the training run:
src = ScriptRunConfig(source_directory=script_folder,
script='train_titanic.py',
# pass dataset as an input with friendly name 'titanic'
arguments=['--input-data', titanic_ds.as_named_input('titanic')],
compute_target=compute_target,
environment=myenv)
Pass mounted_input_path
as argument:
mounted_input_path = sys.argv[1]
mounted_output_path = sys.argv[2]
print("Argument 1: %s" % mounted_input_path)
print("Argument 2: %s" % mounted_output_path)
References: How to create register dataset and How to use configure a training run with data input and output