python azure azure-pipelines azure-machine-learning-service

Serialise objects in azure ML pipeline runs

I have Python package for pre-processing the data for train and scoring/inference purpose. I am using it as a python step in a pipeline. The entry script (which is in package) takes argument i.e task argument choices=(train,score) and does the pre-processing. Here is the step code:

# Pipeline parameter: task, config_path
param_task = PipelineParameter(name='task', default_value='train')
param_config_path = PipelineParameter(name="config_path", default_value='Preprocess/preprocess_config.json')


# Define pipeline steps
StepPreprocessing = PythonScriptStep(
    name="Preprocessing",
    script_name=e.preprocess_script_path,
    arguments=[
        "--config_path", param_config_path, 
        "--task", param_task,
    ], 
    inputs=None,
    compute_target=aml_compute,
    runconfig=run_config,
    source_directory=e.sources_directory,
    allow_reuse=False
)

With argument task=='train' it loads data and does pre-processing according to steps mentioned in a config file. During this process it creates StandardScaler, SimpleImpute objects (sklearn objects) and stores the sklearn objects in a data/output folder inside the package, and the processed data on azure storage.

The problem is, when the pipeline is run again with task =='score' it is unable to find the sklearn objects with error.

User program failed with FileNotFoundError: [Errno 2] No such file or directory: 'data/output/StandardScaler.joblib'

What is the best way to save the sklearn objects so that these can be accessed by pipeline when pipeline in run again but with argument task=='score'.

I don't want to register these objects in model registry and don't want to save them in datastores as well.

Solution

The way to do that is either:

Register the artifacts in model registry and get them in scoring.
Configure output of pipeline step as PipelineData or OutputFileDatasetConfig, write artifacts to configured output. While scoring, get run of the train pipeline, get its outputs, retrieve the artifacts. This involves experiment name to get run of the pipeline.