Search code examples
amazon-sagemakerpython-wheel

Amazon SageMaker ScriptMode Long Python Wheel Build Times for CUDA Components


I use PyTorch estimator with SageMaker to train/fine-tune my Graph Neural Net on multi-GPU machines.

The requirements.txt that gets installed into the Estimator container, has lines like:

torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-spline-conv -f https://data.pyg.org/whl/torch-1.10.0+cu113.html

When SageMaker installs these requirements in the Estimator on the endpoint, it takes ~2 hrs to build the wheel. It takes only seconds on a local Linux box.

SageMaker Estimator:

PyTorch v1.10 CUDA 11.x Python 3.8 Instance: ml.p3.16xlarge

I have noticed the same issue with other wheel-based components that require CUDA.

I have also tried building a Docker container on p3.16xlarge and running that on SageMaker, but it was unable to recognize the instance GPUs

Anything I can do to cut down these build times?


Solution

  • The solution is to augment the stock estimator image with the right components and then it can be run in the SageMaker script mode:

    FROM    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10-gpu-py38
    
    COPY requirements.txt /tmp/requirements.txt
    RUN pip install -r /tmp/requirements.tx
    

    The key is to make sure nvidia runtime is used at build time, so daemon.json needs to be configured accordingly:

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    

    This is still not a complete solution, because viability of the build for SageMaker depends on the host where the build is performed.