amazon-web-services amazon-ec2 gpu amazon-ecs aws-batch

No GPU EC2 instances associated with AWS Batch

I need to set up GPU backed instances on AWS Batch.

Here's my .yaml file:

  GPULargeLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        UserData:
          Fn::Base64:
            Fn::Sub: |
              MIME-Version: 1.0
              Content-Type: multipart/mixed; boundary="==BOUNDARY=="

              --==BOUNDARY==
              Content-Type: text/cloud-config; charset="us-ascii"

              runcmd:
                - yum install -y aws-cfn-bootstrap
                - echo ECS_LOGLEVEL=debug >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
                - /opt/aws/bin/cfn-init -v --region us-west-2 --stack cool_stack --resource LaunchConfiguration
                - echo "DEVS=/dev/xvda" > /etc/sysconfig/docker-storage-setup
                - echo "VG=docker" >> /etc/sysconfig/docker-storage-setup
                - echo "DATA_SIZE=99%FREE" >> /etc/sysconfig/docker-storage-setup
                - echo "AUTO_EXTEND_POOL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "LV_ERROR_WHEN_FULL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "EXTRA_STORAGE_OPTIONS=\"--storage-opt dm.fs=ext4 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker-storage-setup
                - /usr/bin/docker-storage-setup
                - yum update -y
                - echo "OPTIONS=\"--default-ulimit nofile=1024000:1024000 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker
                - /etc/init.d/docker restart

              --==BOUNDARY==--
      LaunchTemplateName: GPULargeLaunchTemplate

  GPULargeBatchComputeEnvironment:
    DependsOn:
      - ComputeRole
      - ComputeInstanceProfile
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      Type: MANAGED
      ComputeResources:
        ImageId: ami-GPU-optimized-AMI-ID
        AllocationStrategy: BEST_FIT_PROGRESSIVE
        LaunchTemplate:
          LaunchTemplateId:
            Ref: GPULargeLaunchTemplate
          Version:
            Fn::GetAtt:
              - GPULargeLaunchTemplate
              - LatestVersionNumber
        InstanceRole:
          Ref: ComputeInstanceProfile
        InstanceTypes:
          - g4dn.xlarge
        MaxvCpus: 768
        MinvCpus: 1
        SecurityGroupIds:
          - Fn::GetAtt:
              - ComputeSecurityGroup
              - GroupId
        Subnets:
          - Ref: ComputePrivateSubnetA
        Type: EC2
        UpdateToLatestImageVersion: True

  MyGPUBatchJobQueue:
    Type: AWS::Batch::JobQueue
    Properties:
      ComputeEnvironmentOrder:
        - ComputeEnvironment:
            Ref: GPULargeBatchComputeEnvironment
          Order: 1
      Priority: 5
      JobQueueName: MyGPUBatchJobQueue
      State: ENABLED

  MyGPUJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      Type: container
      ContainerProperties:
        Command:
          - "/opt/bin/python3"
          - "/opt/bin/start.py"
          - "--retry_count"
          - "Ref::batchRetryCount"
          - "--retry_limit"
          - "Ref::batchRetryLimit"
        Environment:
          - Name: "Region"
            Value: "us-west-2"
          - Name: "LANG"
            Value: "en_US.UTF-8"
        Image:
          Fn::Sub: "cool_1234_abc.dkr.ecr.us-west-2.amazonaws.com/my-image"
        JobRoleArn:
          Fn::Sub: "arn:aws:iam::cool_1234_abc:role/ComputeRole"
        Memory: 16000
        Vcpus: 1
        ResourceRequirements:
          - Type: GPU
            Value: '1'
      JobDefinitionName: MyGPUJobDefinition
      Timeout:
        AttemptDurationSeconds: 500

When I start a job, the job is stuck in RUNNABLE state forever, then I did these:

When I swapped the instance type to be normal CPU types, redeploy the CF stack, submit a job and the job could be run and succeeded fine, so must be something missing/wrong with the way I use these GPU instance types on AWS Batch;
Then I found this post, so I added an ImageId field in my ComputeEnvironment with a known GPU optimized AMI, but still no luck;
I did a side by side comparison for the jobs between the working CPU AWS Batch and non-working GPU AWS Batch via running aws batch describe-jobs --jobs AWS_BATCH_JOB_EXECUTION_ID --region us-west-2, I found that what's missing between them is: containerInstanceArn and taskArn where in the non-working GPU instance, these two fields are just missing.
I found that in the ASG (Auto Scaling Group) created by the Compute Environment, this GPU instance is in this ASG, but when I go to the ECS, chose this GPU cluster, there's no container instances associated with it, unlike the working CPU ones, where the ECS cluster has container instances within.

Any ideas how to fix this would be greatly appreciated!

Solution

This is for sure a great learning, here's what I did and found and resolved this issue:

It boils down to the issue that my newly launched GPU instance cannot join the ECS cluster (both are launched by the above yaml CloudFormation template);
Do some first checks: vpc, subnet, security groups to see if anything is blocking/preventing the new GPU instance from joining the ECS cluster;
Go through troubleshooting steps here: https://repost.aws/knowledge-center/batch-job-stuck-runnable-status
In the above link, there's this AWSSupport-TroubleshootAWSBatchJob runbook which turns out to be helpful (make sure you choose your right region) before running;
Connect to your GPU instance and install ECS logs collector: https://github.com/aws/amazon-ecs-logs-collector
Check your logs, here, I found the issue:

30T01:19:48Z msg="Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch"
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal systemd[1]: ecs.service: control process exited, code=exited status=255
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal kernel: NVRM: API mismatch: the client has the version 535.161.07, but
                                                                   NVRM: this kernel module has the version 470.182.03.  Please
                                                                   NVRM: make sure that this kernel module and all NVIDIA driver
                                                                   NVRM: components have the same version.

So, somehow, my cdk didn't know how to pull in the latest AMI that's optimized for GPU instances (in theory, it should per aws doc) causing a version mismatch, then I went to https://github.com/aws/amazon-ecs-ami/releases to find the latest AMI ID: ami-019d947e77874eaee, then add this field: ImageId: ami-019d947e77874eaee in my template, redeploy, then you could use a few commands to check the status of your GPU EC2 instance: systemctl status ecs should be up and running so that your GPU could join your ECS cluster, sudo docker info should return info which shows that it's running, nvidia-smi should return info showing that your nvidia driver is properly installed and running, example info:

Sat Mar 30 13:47:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   20C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Boom, if all of these work, your GPU backed AWS Batch should be happily take in scheduled jobs and run! :)