Search code examples
pythonamazon-web-servicesscikit-learnamazon-sagemaker

How to pass dependency files to sagemaker SKLearnProcessor and use it in Pipeline?


I need to import function from different python scripts, which will used inside preprocessing.py file. I was not able to find a way to pass the dependent files to SKLearnProcessor Object, due to which I am getting ModuleNotFoundError.

Code:

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)


sklearn_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/test')],
                      arguments=['--train-test-split-ratio', '0.2']
                     )

I would like to pass, dependent_files = ['file1.py', 'file2.py', 'requirements.txt']. So, that preprocessing.py have access to all the dependent modules.

And also need to install libraries from requirements.txt file.

Can you share any work around or a right way to do this?

Update-25-11-2021:

Q1.(Answered but looking to solve using FrameworkProcessor)

Here, the get_run_args function, is handling dependencies, source_dir and code parameters by using FrameworkProcessor. Is there any way that we can set this parameters from ScriptProcessor or SKLearnProcessor or any other Processor to set them?

Q2.

Can you also please show some reference to use our Processor as sagemaker.workflow.steps.ProcessingStep and then use in sagemaker.workflow.pipeline.Pipeline?

For having Pipeline, do we need sagemaker-project as mandatory or can we create Pipeline directly without any Sagemaker-Project?


Solution

  • There are a couple of options for you to accomplish that.

    One that is really simple is adding all additional files to a folder, example:

    .
    ├── my_package
    │   ├── file1.py
    │   ├── file2.py
    │   └── requirements.txt
    └── preprocessing.py
    

    Then send this entire folder as another input under the same /opt/ml/processing/input/code/, example:

    from sagemaker.sklearn.processing import SKLearnProcessor
    from sagemaker.processing import ProcessingInput, ProcessingOutput
    
    sklearn_processor = SKLearnProcessor(
        framework_version="0.20.0",
        role=role,
        instance_type="ml.m5.xlarge",
        instance_count=1,
    )
    
    sklearn_processor.run(
        code="preprocessing.py",  # <- this gets uploaded as /opt/ml/processing/input/code/preprocessing.py
        inputs=[
            ProcessingInput(source=input_data, destination='/opt/ml/processing/input'),
            # Send my_package as /opt/ml/processing/input/code/my_package/
            ProcessingInput(source='my_package/', destination="/opt/ml/processing/input/code/my_package/")
        ],
        outputs=[
            ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
            ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
        ],
        arguments=["--train-test-split-ratio", "0.2"],
    )
    

    What happens is that sagemaker-python-sdk is going to put your argument code="preprocessing.py" under /opt/ml/processing/input/code/ and you will have my_package/ under the same directory.

    Edit:

    For the requirements.txt, you can add to your preprocessing.py:

    import sys
    import subprocess
    
    subprocess.check_call([
        sys.executable, "-m", "pip", "install", "-r",
        "/opt/ml/processing/input/code/my_package/requirements.txt",
    ])