Search code examples
python-3.xamazon-sagemakeramazon-sagemaker-studio

how to install requirements in a sagemaker processing step/job within a sagemaker pipeline?


I have setting up a processing job in a sagemaker pipeline, and my project has following files

projectA
  - run.py
  - requirements.txt 

I have some dependencies that i need to install before i run my script , which are listed in requirements.txt. I'm not sure , how can i set up the processing step , such that it installs the requirements before it runs my script.

any thoughts?

from sagemaker.processing import ScriptProcessor
from sagemaker.workflow.steps import ProcessingStep

script_processor = ScriptProcessor(
    instance_type='ml.t3.medium', 
    instance_count=1, 
     ...
    command = ['python']
)

processing_step = ProcessingStep(
  name='p_step',
  processor= script_processor,
  code = './run.py',
  inputs = [
  ...]
)


Solution

  • Based on my knowledge of SageMaker Pipelines and SageMaker Processing Jobs, there are 2 ways to manage dependencies - either you create an image and specify it in the image_uri when defining the ScriptProcessor object or you install them during the job runtime. Here is how to do the second approach.

    I provide the following example (which uses the SKLearnProcessor class for the job):

    1.Define the processing job:

    sklearn_processor = SKLearnProcessor(
            framework_version="1.2-1",
            instance_type=instance_type_preprocessing.default_value,
            instance_count=instance_count.default_value,
            role=role_pipeline,
            sagemaker_session=pipeline_session,
    )
    
    1. Define the job step arguments in the following:
    
    step_args = sklearn_processor.run(
            inputs=[
                ProcessingInput(
                    input_name="raw-data",
                    source=input_data,
                    destination="/opt/ml/processing/input",
                ),
                ProcessingInput(
                    input_name="preprocessor",
                    source=os.path.join(BASE_DIR, "my_job/requirements/"),
                    destination="/opt/ml/processing/input/code/requirements/",
                ),
            outputs=[
                ProcessingOutput(
                    output_name="train",
                    source="/opt/ml/processing/artifacts/ml_modelling/train",
                    destination=Join(
                        on="/",
                        values=[
                            ...
                        ],
                    ),
                ),
            ],
            code=os.path.join(BASE_DIR, "my_job/run.py"),
        )
    
    

    The second item in inputs, should point out to the requirements.txt file, which I recommended you bundle everything together in a my_job directory.

    1. Then, in your entry point script for the job, put the following in the before making the imports of the dependencies you required:
    import traceback
    import os
    import sys
    import subprocess
    
    subprocess.check_call(
        [
           sys.executable,
           "-m",
           "pip",
           "install",
           "-r",
           "/opt/ml/processing/input/code/requirements/requirements.txt",
        ]
    )
    

    4.Proceed to add the import statements of your dependencies as normal. 5. Verify your project structure is as follows:

    my_project
     - my_job/
         - requirements/
              - requirements.txt
         - run.py
    

    Note on this approach: Only do it if you trust what is in the requirements.txt file and you don't want to build the image and push it to ECR.

    Do let me know if this solve your issue and/or you have questions on the code.