Search code examples
amazon-sagemaker

Empty output job folder


I have setup local JupyterLab instance with SageMaker extension in order to run jobs in the AWS and now I see that jobs are started with proper .ipynb file in the input S3 folder but do not see any result because no files with "output" prefix exist and nothing to download.

Everything is configured following this manual. Roles, permission are double checked and I have run out of ideas. Could you give me a direction? Do we have any relevant logs to have a glance under the hood?

Update 1

Full scenario:

  1. Create IAM User (name: lab)
  2. Create inline policy for this user (name: LabPolicy)
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EventBridgeSchedule",
            "Effect": "Allow",
            "Action": [
                "events:TagResource",
                "events:DeleteRule",
                "events:PutTargets",
                "events:DescribeRule",
                "events:EnableRule",
                "events:PutRule",
                "events:RemoveTargets",
                "events:DisableRule"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
                }
            }
        },
        {
            "Sid": "IAMPassRoleToNotebookJob",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/SagemakerJupyterScheduler*",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": [
                        "sagemaker.amazonaws.com",
                        "events.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Sid": "IAMListRoles",
            "Effect": "Allow",
            "Action": "iam:ListRoles",
            "Resource": "*"
        },
        {
            "Sid": "S3ArtifactsAccess",
            "Effect": "Allow",
            "Action": [
                "s3:PutEncryptionConfiguration",
                "s3:CreateBucket",
                "s3:PutBucketVersioning",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetEncryptionConfiguration",
                "s3:DeleteObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-automated-execution-*"
            ]
        },
        {
            "Sid": "S3DriverAccess",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::sagemakerheadlessexecution-*"
            ]
        },
        {
            "Sid": "SagemakerJobs",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeTrainingJob",
                "sagemaker:StopTrainingJob",
                "sagemaker:DescribePipeline",
                "sagemaker:CreateTrainingJob",
                "sagemaker:DeletePipeline",
                "sagemaker:CreatePipeline"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
                }
            }
        },
        {
            "Sid": "AllowSearch",
            "Effect": "Allow",
            "Action": "sagemaker:Search",
            "Resource": "*"
        },
        {
            "Sid": "SagemakerTags",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListTags",
                "sagemaker:AddTags"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:space/*",
                "arn:aws:sagemaker:*:*:training-job/*",
                "arn:aws:sagemaker:*:*:user-profile/*"
            ]
        },
        {
            "Sid": "ECRImage",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchGetImage"
            ],
            "Resource": "*"
        }
    ]
}
  1. Create IAM Role (name: SagemakerJupyterSchedulerRole)
  2. Replace the existing trust policy
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com",
                    "events.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
  1. Attach AmazonSageMakerFullAccess
  2. Create and Attach execution policy (name: SagemakerJupyterSchedulerExecutionPolicy)
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-automated-execution-xxxxxxxxxxxx-us-east-1/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:ListAllMyBuckets",
                "s3:GetBucketCors",
                "s3:PutBucketCors"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketAcl",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-automated-execution-xxxxxxxxxxxx-us-east-1"
            ]
        }
    ]
}
  1. Job Options
Image: arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-base-python-38
Kernel: python3
Role ARN: arn:aws:iam::xxxxxxxxxxxx:role/SagemakerJupyterSchedulerRole
Input: s3://sagemaker-automated-execution-xxxxxxxxxxxx-us-east-1/
Output: s3://sagemaker-automated-execution-xxxxxxxxxxxx-us-east-1/
  1. Run!

Eventually, I see just input ipynb-files in the target S3 bucket

❯ aws s3 ls s3://sagemaker-automated-execution-xxxxxxxxxxxx-us-east-1/ --recursive
2024-04-18 16:57:19       4536 helloworld2ipynb-helloworld2-45e83409-2024-04-18-16-57-17/input/hello-world2.ipynb
2024-04-18 16:49:00       4536 helloworld2ipynb-helloworld2-4d179d7f-2024-04-18-16-48-57/input/hello-world2.ipynb
2024-04-15 11:31:10       4536 helloworld2ipynb-helloworld2-ae714726-2024-04-15-11-31-08/input/hello-world2.ipynb
2024-04-15 10:47:33       2631 helloworld2ipynb-helloworld2-b516481a-2024-04-15-10-47-31/input/hello-world2.ipynb

Update 2

@Tomonori Shimomura gave an idea to make an experiment and eventually I have realised that job is not run by SageMaker at all!

Now need to mention one step back and my previous question on the StackOverflow where I came up with my own "solution", but now it seems that my current issue somehow related to that hack.

Update 3

I put to the resource-metadata.json:

{
    "AppType": "JupyterLab"
}

... because I had found in the amazon_sagemaker_jupyter_scheduler/model_converter.py:

if get_app_type() == "JupyterLab":
    container_entrypoint = ["amazon_sagemaker_scheduler"]
else:
    container_entrypoint = ["/bin/bash"]

Now job starts but fails immediately with error: "[FATAL tini (8)] exec amazon_sagemaker_scheduler failed: No such file or directory"

I am trying to figure it out, but don't have any ideas yet.

Final Update

I have missed strict version requirements in the manual and was trying to use different versions from 3rd branch. Fortunately, @Tomonori-san have noticed my mistake.

Though, need to mention that not everything has started to work as expected right after the downgrade because I had faced with GetBucketLocation issue and just after several hours of debugging aiobotocore and amazon-sagemaker-jupyter-scheduler extensions I have finally realised that S3 bucket has to be hosted somewhere else rather than us-east-1 because LocationConstraint of such buckets is null what affects logic in amazon_sagemaker_jupyter_scheduler/clients.py

(base) lab4:~$ aws s3api get-bucket-location --bucket sagemaker-automated-execution-xxxxxxxxxxxx-us-east-2
{
    "LocationConstraint": "us-east-2"
}
(base) lab4:~$ aws s3api get-bucket-location --bucket sagemaker-automated-execution-xxxxxxxxxxxx-us-east-1
{
    "LocationConstraint": null
}

Solution

  • This issue is because of the manually added resource-metadata.json.

    Please check the answer in your previous post for the solution to avoid "[Errno 2] No such file or directory: '/opt/ml/metadata/resource-metadata.json'" error, and you can revert your hacks.