Amazon SageMaker ScriptMode Long Python Wheel Build Times for CUDA Components

I use PyTorch estimator with SageMaker to train/fine-tune my Graph Neural Net on multi-GPU machines.

The requirements.txt that gets installed into the Estimator container, has lines like:

torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-spline-conv -f https://data.pyg.org/whl/torch-1.10.0+cu113.html

When SageMaker installs these requirements in the Estimator on the endpoint, it takes ~2 hrs to build the wheel. It takes only seconds on a local Linux box.

SageMaker Estimator:

PyTorch v1.10 CUDA 11.x Python 3.8 Instance: ml.p3.16xlarge

I have noticed the same issue with other wheel-based components that require CUDA.

I have also tried building a Docker container on p3.16xlarge and running that on SageMaker, but it was unable to recognize the instance GPUs

Anything I can do to cut down these build times?

Solution

The solution is to augment the stock estimator image with the right components and then it can be run in the SageMaker script mode:

FROM    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10-gpu-py38

COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.tx

The key is to make sure nvidia runtime is used at build time, so daemon.json needs to be configured accordingly:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

This is still not a complete solution, because viability of the build for SageMaker depends on the host where the build is performed.