Search code examples
azure-machine-learning-servicegreat-expectations

How to acces output folder from a PythonScriptStep?


I'm new to azure-ml, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'. The following example code is a simplified version of what I want to do:

ws = Workspace.from_config('blabla/config.json')
ds = Datastore.get(ws, datastore_name='test_datastore')

main_ref = DataReference(datastore=ds,
                            data_reference_name='main_ref'
                            )

data_ref = DataReference(datastore=ds,
                            data_reference_name='main_ref',
                            path_on_datastore='/data'
                            )


data_prep_step = PythonScriptStep(
            name='data_prep',
            script_name='pipeline_steps/data_prep.py',
            source_directory='/.',
            arguments=['--main_path', main_ref,
                        '--data_ref_folder', data_ref
                        ],
            inputs=[main_ref, data_ref],
            outputs=[data_ref],
            runconfig=arbitrary_run_config,
            allow_reuse=False
            )

I would like:

  • my data_prep_step to run,
  • have it store some data on the path to my data_ref), and
  • I would then like to access this stored data afterwards outside of the pipeline

But, I can't find a useful function in the documentation. Any guidance would be much appreciated.


Solution

  • two big ideas here -- let's start with the main one.

    main ask

    With an Azure ML Pipeline, how can I access the output data of a PythonScriptStep outside of the context of the pipeline?

    short answer

    Consider using OutputFileDatasetConfig (docs example), instead of DataReference.

    To your example above, I would just change your last two definitions.

    data_ref = OutputFileDatasetConfig(
        name='data_ref',
        destination=(ds, '/data')
    ).as_upload()
    
    
    data_prep_step = PythonScriptStep(
        name='data_prep',
        script_name='pipeline_steps/data_prep.py',
        source_directory='/.',
        arguments=[
            '--main_path', main_ref,
            '--data_ref_folder', data_ref
                    ],
        inputs=[main_ref, data_ref],
        outputs=[data_ref],
        runconfig=arbitrary_run_config,
        allow_reuse=False
    )
    

    some notes:

    • be sure to check out how DataPaths work. Can be tricky at first glance.
    • set overwrite=False in the `.as_upload() method if you don't want future runs to overwrite the first run's data.

    more context

    PipelineData used to be the defacto object to pass data ephemerally between pipeline steps. The idea was to make it easy to:

    1. stitch steps together
    2. get the data after the pipeline runs if need be (datastore/azureml/{run_id}/data_ref)

    The downside was that you have no control over where the pipeline is saved. If you wanted to data for more than just as a baton that gets passed between steps, you could have a DataTransferStep to land the PipelineData wherever you please after the PythonScriptStep finishes.

    This downside is what motivated OutputFileDatasetConfig

    auxilary ask

    how might I programmatically test the functionality of my Azure ML pipeline?

    there are not enough people talking about data pipeline testing, IMHO.

    There are three areas of data pipeline testing:

    1. unit testing (the code in the step works?
    2. integration testing (the code works when submitted to the Azure ML service)
    3. data expectation testing (the data coming out of the meets my expectations)

    For #1, I think it should be done outside of the pipeline perhaps as part of a package of helper functions For #2, Why not just see if the whole pipeline completes, I think get more information that way. That's how we run our CI.

    #3 is the juiciest, and we do this in our pipelines with the Great Expectations (GE) Python library. The GE community calls these "expectation tests". To me you have two options for including expectation tests in your Azure ML pipeline:

    1. within the PythonScriptStep itself, i.e.
      1. run whatever code you have
      2. test the outputs with GE before writing them out; or,
    2. for each functional PythonScriptStep, hang a downstream PythonScriptStep off of it in which you run your expectations against the output data.

    Our team does #1, but either strategy should work. What's great about this approach is that you can run your expectation tests by just running your pipeline (which also makes integration testing easy).