I'm new to azure-ml
, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'
. The following example code is a simplified version of what I want to do:
ws = Workspace.from_config('blabla/config.json')
ds = Datastore.get(ws, datastore_name='test_datastore')
main_ref = DataReference(datastore=ds,
data_reference_name='main_ref'
)
data_ref = DataReference(datastore=ds,
data_reference_name='main_ref',
path_on_datastore='/data'
)
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=['--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
I would like:
data_prep_step
to run,data_ref
), andBut, I can't find a useful function in the documentation. Any guidance would be much appreciated.
two big ideas here -- let's start with the main one.
With an Azure ML Pipeline, how can I access the output data of a
PythonScriptStep
outside of the context of the pipeline?
Consider using OutputFileDatasetConfig
(docs example), instead of DataReference
.
To your example above, I would just change your last two definitions.
data_ref = OutputFileDatasetConfig(
name='data_ref',
destination=(ds, '/data')
).as_upload()
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=[
'--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
some notes:
DataPath
s work. Can be tricky at first glance.overwrite=False
in the `.as_upload() method if you don't want future runs to overwrite the first run's data.PipelineData
used to be the defacto object to pass data ephemerally between pipeline steps. The idea was to make it easy to:
datastore/azureml/{run_id}/data_ref
)The downside was that you have no control over where the pipeline is saved. If you wanted to data for more than just as a baton that gets passed between steps, you could have a DataTransferStep
to land the PipelineData
wherever you please after the PythonScriptStep
finishes.
This downside is what motivated OutputFileDatasetConfig
how might I programmatically test the functionality of my Azure ML pipeline?
there are not enough people talking about data pipeline testing, IMHO.
There are three areas of data pipeline testing:
For #1, I think it should be done outside of the pipeline perhaps as part of a package of helper functions For #2, Why not just see if the whole pipeline completes, I think get more information that way. That's how we run our CI.
#3 is the juiciest, and we do this in our pipelines with the Great Expectations (GE) Python library. The GE community calls these "expectation tests". To me you have two options for including expectation tests in your Azure ML pipeline:
PythonScriptStep
itself, i.e.
PythonScriptStep
, hang a downstream PythonScriptStep
off of it in which you run your expectations against the output data.Our team does #1, but either strategy should work. What's great about this approach is that you can run your expectation tests by just running your pipeline (which also makes integration testing easy).