Search code examples
pythonazureazure-machine-learning-service

how to write to Azure PipelineData properly?


Im trying to learn Azure, with little luck (yet). All the tutorials show using PipelineData just as a file, when configured in "upload" mode. However, im getting "FileNotFoundError: [Errno 2] No such file or directory: ''" error. I would love to ask a more specific question, but i just can't see what im doing wrong.

from azureml.core import Workspace, Datastore,Dataset,Environment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineData
import os

ws = Workspace.from_config()
datastore = ws.get_default_datastore()

compute_name = "cpucluster"
compute_target = ComputeTarget(workspace=ws, name=compute_name)
aml_run_config = RunConfiguration()
aml_run_config.target = compute_target
aml_run_config.environment.python.user_managed_dependencies = False
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
    conda_packages=['pandas','scikit-learn'], 
    pip_packages=['azureml-sdk', 'azureml-dataprep[fuse,pandas]'], 
    pin_sdk_version=False)

output1 = PipelineData("processed_data1",datastore=datastore, output_mode="upload")
prep_step = PythonScriptStep(
    name="dataprep",
    script_name="dataprep.py",
    source_directory=os.path.join(os.getcwd(),'dataprep'),
    arguments=["--output", output1],
    outputs = [output1],
    compute_target=compute_target,
    runconfig=aml_run_config,
    allow_reuse=True
)

In the dataprep.py i hve the following:

import numpy, argparse, pandas
from azureml.core import Run
run = Run.get_context()
parser = argparse.ArgumentParser()
parser.add_argument('--output', dest='output', required=True)
args = parser.parse_args()
df = pandas.DataFrame(numpy.random.rand(100,3))
df.iloc[:, 2] = df.iloc[:,0] + df.iloc[:,1]
print(df.iloc[:5,:])
df.to_csv(args.output)

So, yeah. pd is supposed to write to the output, but my compute cluster says the following:

"User program failed with FileNotFoundError: [Errno 2] No such file or directory: ''\".

When i dont include the to_csv() function, the cluster does not complain


Solution

  • Here is an example for PRS. PipelineData was intended to represent "transient" data from one step to the next one, while OutputDatasetConfig was intended for capturing the final state of a dataset (and hence why you see features like lineage, ADLS support, etc). PipelineData always outputs data in a folder structure like {run_id}{output_name}. OutputDatasetConfig allows to decouple the data from the run and hence it allows you to control where to land the data (although by default it will produce similar folder structure). The OutputDatasetConfig allows even to register the output as a Dataset, where getting rid of such folder structure makes sense. From the docs itself: "Represent how to copy the output of a run and be promoted as a FileDataset. The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination".

    OutFileDatasetConfig is a control plane concept to pass data between pipeline steps.