amazon-web-services amazon-sagemaker mlops

What are SageMaker pipelines actually?

Sagemaker pipelines are rather unclear to me, I'm not experienced in the field of ML but I'm working on figuring out the pipeline definitions.

I have a few questions:

Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.
Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?
There's also a Python SDK, how does this differ from the CDK and CloudFormation?

I can't seem to find any examples besides the Python SDK usage, how come?

The docs and workshops seem only to properly describe the Python SDK usage,it would be really helpful if someone could clear this up for me!

Solution

SageMaker has two things called Pipelines: Model Building Pipelines and Serial Inference Pipelines. I believe you're referring to the former

A model building pipeline defines steps in a machine learning workflow, such as pre-processing, hyperparameter tuning, batch transformations, and setting up endpoints

A serial inference pipeline is two or more SageMaker models run one after the other

A model building pipeline is defined in JSON, and is hosted/run in some sort of proprietary, serverless fashion by SageMaker

Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.

You can create/modify them using the API, which can also be called via the CLI, Python SDK, or CloudFormation. These all use the AWS API under the hood

You can start/stop/view them in SageMaker Studio:

Left-side Navigation bar > SageMaker resources > Drop-down menu > Pipelines

Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?

Unlikely. CodePipeline is more for building and deploying code, not specific to SageMaker. There is no direct integration as far as I can tell, other than that you can start a SM pipeline with CP

There's also a Python SDK, how does this differ from the CDK and CloudFormation?

The Python SDK is a stand-alone library to interact with SageMaker in a developer-friendly fashion. It's more dynamic than CloudFormation. Let's you build pipelines using code. Whereas CloudFormation takes a static JSON string

A very simple example of Python SageMaker SDK usage:

processor = SKLearnProcessor(
    framework_version="0.23-1",
    instance_count=1,
    instance_type="ml.m5.large",
    role="role-arn",
)

processing_step = ProcessingStep(
    name="processing",
    processor=processor,
    code="preprocessor.py"
)

pipeline = Pipeline(name="foo", steps=[processing_step])
pipeline.upsert(role_arn = ...)
pipeline.start()

pipeline.definition() produces rather verbose JSON like this:

{
"Version": "2020-12-01",
"Metadata": {},
"Parameters": [],
"PipelineExperimentConfig": {
    "ExperimentName": {
        "Get": "Execution.PipelineName"
    },
    "TrialName": {
        "Get": "Execution.PipelineExecutionId"
    }
},
"Steps": [
    {
        "Name": "processing",
        "Type": "Processing",
        "Arguments": {
            "ProcessingResources": {
                "ClusterConfig": {
                    "InstanceType": "ml.m5.large",
                    "InstanceCount": 1,
                    "VolumeSizeInGB": 30
                }
            },
            "AppSpecification": {
                "ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
                "ContainerEntrypoint": [
                    "python3",
                    "/opt/ml/processing/input/code/preprocessor.py"
                ]
            },
            "RoleArn": "arn:aws:iam::123456789012:role/foo",
            "ProcessingInputs": [
                {
                    "InputName": "code",
                    "AppManaged": false,
                    "S3Input": {
                        "S3Uri": "s3://bucket/preprocessor.py",
                        "LocalPath": "/opt/ml/processing/input/code",
                        "S3DataType": "S3Prefix",
                        "S3InputMode": "File",
                        "S3DataDistributionType": "FullyReplicated",
                        "S3CompressionType": "None"
                    }
                }
            ]
        }
    }
  ]
}

You could use the above JSON with CloudFormation/CDK, but you build the JSON with the SageMaker SDK

You can also define model building workflows using Step Function State Machines, using the Data Science SDK, or Airflow