Search code examples
pythonazure-machine-learning-serviceazureml-python-sdkazuremlsdk

ModuleNotFoundError while using AzureML pipeline with yml file based RunConfiguration and environment.yml


I am running into a ModuleNotFoundError for pandas while using the following code to orchestrate my Azure Machine Learning Pipeline:

# Loading run config
print("Loading run config")
task_1_run_config = RunConfiguration.load(
    os.path.join(WORKING_DIR + '/pipeline/task_runconfigs/T01_Test_Task.yml')
    ) 

task_1_script_run_config = ScriptRunConfig(
    source_directory=os.path.join(WORKING_DIR + '/pipeline/task_scripts'),
    run_config=task_1_run_config    
)

task_1_py_script_step = PythonScriptStep(
    name='Task_1_Step',
    script_name=task_1_script_run_config.script,
    source_directory=task_1_script_run_config.source_directory,
    compute_target=compute_target
)

pipeline_run_config = Pipeline(workspace=workspace, steps=[task_1_py_script_step])#, task_2])

pipeline_run = Experiment(workspace, 'Test_Run_New_Pipeline').submit(pipeline_run_config)
pipeline_run.wait_for_completion()

The environment.yml

name: phinmo_pipeline_env
dependencies:
- python=3.8
- pip:
  - pandas
  - azureml-core==1.43.0
  - azureml-sdk
  - scipy
  - scikit-learn
  - numpy
  - pyyaml==6.0
  - datetime
  - azure
channels:
  - conda-forge

The loaded RunConfiguration in T01_Test_Task.yml looks like this:

# The script to run.
script: T01_Test_Task.py
# The arguments to the script file.
arguments: [
  "--test", False,
  "--date", "2022-07-26"
]
# The name of the compute target to use for this run.
compute_target: phinmo-compute-cluster
# Framework to execute inside. Allowed values are "Python", "PySpark", "CNTK", "TensorFlow", and "PyTorch".
framework: Python
# Maximum allowed duration for the run.
maxRunDurationSeconds: 6000
# Number of nodes to use for running job.
nodeCount: 1

#Environment details.
environment:
  # Environment name
  name: phinmo_pipeline_env
  # Environment version
  version:
  # Environment variables set for the run.
  #environmentVariables:
  #  EXAMPLE_ENV_VAR: EXAMPLE_VALUE
  # Python details
  python:
    # user_managed_dependencies=True indicates that the environmentwill be user managed. False indicates that AzureML willmanage the user environment.
    userManagedDependencies: false
    # The python interpreter path
    interpreterPath: python
    # Path to the conda dependencies file to use for this run. If a project
    # contains multiple programs with different sets of dependencies, it may be
    # convenient to manage those environments with separate files.
    condaDependenciesFile: environment.yml
    # The base conda environment used for incremental environment creation.
    baseCondaEnvironment: AzureML-sklearn-0.24-ubuntu18.04-py37-cpu
  # Docker details
  
# History details.
history:
  # Enable history tracking -- this allows status, logs, metrics, and outputs
  # to be collected for a run.
  outputCollection: true
  # Whether to take snapshots for history.
  snapshotProject: true
  # Directories to sync with FileWatcher.
  directoriesToWatch:
  - logs
# data reference configuration details
dataReferences: {}
# The configuration details for data.
data: {}
# Project share datastore reference.
sourceDirectoryDataStore:

I already tried a few things like overwriting the environment attribute in the RunConfiguration object with a environment.python.conda_dependencies object or assigning a version number to pandas in the environment.yml, changing the location of the environment.yml. But I am at a loss at what else to try. the T01_Test_Task.py runs without issues on its own. But putting it into a pipeline just does not seem to work.


Solution

  • Okay I found the issue. I am unnecessarily using the ScriptRunConfig which overwrites the assigned environment with some default azureml environment. I was able to see that only in the Task description in the Azure Machine Learning Studio UI.

    I was able to just remove that part and now it works:

    task_1_run_config = RunConfiguration.load(
        os.path.join(WORKING_DIR + '/pipeline/task_runconfigs/T01_Test_Task.yml')
        ) 
    task_1_py_script_step = PythonScriptStep(
        name='Task_1_Step',
        script_name='T01_Test_Task.py',
        source_directory=os.path.join(WORKING_DIR + '/pipeline/task_scripts'),
        runconfig=task_1_run_config, 
        compute_target=compute_target
    )