Search code examples
pythonpipelineazure-databricksazure-machine-learning-service

Can't find scoring.py when using PythonScriptStep() in Databricks


We are defining in Databricks a PythonScriptStep(). When using PythonScriptStep() within our pipeline script we can't find the scoring.py file.

scoring_step = PythonScriptStep(
    name="Scoring_Step",
    source_directory=os.getenv("DATABRICKS_NOTEBOOK_PATH", "/Users/USER_NAME/source_directory"),
    script_name="./scoring.py",
    arguments=["--input_dataset", ds_consumption],
    compute_target=pipeline_cluster,
    runconfig=pipeline_run_config,
    allow_reuse=False)

We getting the following error message:

Step [Scoring_Step]: script not found at: /databricks/driver/scoring.py. Make sure to specify an appropriate source_directory on the Step or default_source_directory on the Pipeline.

For some reason Databricks is searching for the file in '/databricks/driver/' instead of the folder we entered.

There is also the way to use DatabricksStep() instead of PythonScriptStep(), but because of specific reasons we need to use the PythonSriptStep() class.

Could anybody help us with this specific problem?

Thank you very much for any help!


Solution

  • scoring_step = PythonScriptStep(
        name="Scoring_Step",
        source_directory=os.getenv("DATABRICKS_NOTEBOOK_PATH", "/Users/USER_NAME/source_directory"),
        script_name="./scoring.py",
        arguments=["--input_dataset", ds_consumption],
        compute_target=pipeline_cluster,
        runconfig=pipeline_run_config,
        allow_reuse=False)
    

    Change the above code block with below code block. It will resolve the error

    data_ref = OutputFileDatasetConfig(
        name='data_ref',
        destination=(ds, '/data')
    ).as_upload()
    
    
    data_prep_step = PythonScriptStep(
        name='data_prep',
        script_name='pipeline_steps/data_prep.py',
        source_directory='/.',
        arguments=[
            '--main_path', main_ref,
            '--data_ref_folder', data_ref
                    ],
        inputs=[main_ref, data_ref],
        outputs=[data_ref],
        runconfig=arbitrary_run_config,
        allow_reuse=False
    )
    

    Reference link for the documentation