I use PyTorch estimator with SageMaker to train/fine-tune my Graph Neural Net on multi-GPU machines.
The requirements.txt
that gets installed into the Estimator container, has lines like:
torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-spline-conv -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
When SageMaker installs these requirements in the Estimator on the endpoint, it takes ~2 hrs to build the wheel. It takes only seconds on a local Linux box.
SageMaker Estimator:
PyTorch v1.10 CUDA 11.x Python 3.8 Instance: ml.p3.16xlarge
I have noticed the same issue with other wheel-based components that require CUDA.
I have also tried building a Docker container on p3.16xlarge and running that on SageMaker, but it was unable to recognize the instance GPUs
Anything I can do to cut down these build times?
The solution is to augment the stock estimator image with the right components and then it can be run in the SageMaker script mode:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10-gpu-py38
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.tx
The key is to make sure nvidia
runtime is used at build time, so daemon.json
needs to be configured accordingly:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
This is still not a complete solution, because viability of the build for SageMaker depends on the host where the build is performed.