Search code examples
pythonpytorchazure-machine-learning-service

Azure ML experiment using custom GPU CUDA environment


During the last week I have been trying to create a python experiment in Azure ML studio. The job consists on training a PyTorch (1.12.1) Neural Network using a custom environment with CUDA 11.6 for GPU acceleration. However, when attempting any movement operation I get a Runtime Error:

device = torch.device("cuda")
test_tensor = torch.rand((3, 4), device = "cpu")
test_tensor.to(device)
CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I have tried to set CUDA_LAUNCH_BLOCKING=1, but this does not change the result.

I have also tried to check if CUDA is available:

print(f"Is cuda available? {torch.cuda.is_available()}")
print(f"Which is the current device? {torch.cuda.current_device()}")
print(f"How many devices do we have? {torch.cuda.device_count()}")
print(f"How is the current device named? {torch.cuda.get_device_name(torch.cuda.current_device())}")

and the result is completely normal:

Is cuda available? True
Which is the current device? 0
How many devices do we have? 1
How is the current device named? Tesla K80

I also tried to downgrade and change the CUDA, Torch and Python versions, but this does not seem to affect the error.

As far as I found this error appears only when using a custom environment. When a curated environment is used, the scripts runs with no problem. However, as the script needs of some libraries like OpenCV, I am forced to use a custom DockerFile to create my environment, which you can read here for reference:

FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1


USER root
RUN apt update
# Necessary dependencies for OpenCV
RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y 

RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
RUN pip install 'ipykernel~=6.0' \
                'azureml-core' \
        'azureml-dataset-runtime' \
                'azureml-defaults' \
        'azure-ml' \
        'azure-ml-component' \
                'azureml-mlflow' \
                'azureml-telemetry' \
        'azureml-contrib-services'

COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888

The code from the COPY statement is a copy from one of the curated environments already predefined by Azure. I would like to highlight that I tried using the DockerFile given in one of these environments, without any modification and I get the same result.

Hence, my question is: How can I run a CUDA job using a custom environment? Is it possible?

I have tried to find a solution for this but I have not been able of finding any person with the same problem, nor any place in the Microsoft documentation where I could ask for this. I hope this is not duplicated and that any of you can help me out here.


Solution

  • The problem is indeed sensitive and hard to debug. I suspect it has to do with the underlying hardware on which the docker container is deployed, not with the actual custom Docker container and its corresponding dependencies.

    Since you have a Tesla K80, I suspect NC series video cards (upon which the environments are deployed).

    As of writing this comment (10th of February 2023), the following observation is valid (https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments):

    Note

    Currently, due to underlying cuda and cluster incompatibilities, on NC series only AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu with cuda 11.3 can be used.

    Therefore, in my opinion, this can be traced back to the supported versions of CUDA + PyTorch and Python.

    What I did in my case, I just installed my dependences via a .yaml dependency file when creating the environment, starting from this base image:

    Azure container registry

    mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9
    

    You can start building your docker container from this URI as base image in order to work properly on Tesla K80s.

    IMPORTANT NOTE : Using this base image did work in my case, I was able to train PyTorch models.