Search code examples
dockerpytorchdockerfile

Inexplicable behavior when using docker container with PyTorch and CUDA


So, I want to make a docker image where I train neural network models using pytorch with a GPU. The plan is to run this image on Vertex AI.

I am using this nvcr.io/nvidia/pytorch:23.07-py3 as a base image, since it should already have python 3.10 installed with pytorch and I can pip install the rest of my dependencies.

The docker file looks like this:

FROM nvcr.io/nvidia/pytorch:23.07-py3
# nvcr.io/nvidia/pytorch:23.07-py3 has python 3.10.6 and pytorch 2.1 pre-installed

# Specify working directory for next commands
WORKDIR /root

# copy my code to /root
...

# install pip dependencies (pytorch and cuda support should already be installed)
RUN pip install -r pip_requirements.txt

# THIS DOESN'T WORK FOR SOME REASON: 
# ENTRYPOINT ["python", "train_pytorch_model.py"]

# This works but I can't pass command-line args to it because "$@" is blank 
#ENTRYPOINT ["/bin/bash", "-c", "python train_pytorch_model.py $@"]

When I try to use ENTRYPOINT ["python", "train_pytorch_model.py"] I am met with the error:

UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:108.)

However when I use ENTRYPOINT ["/bin/bash"], run the docker container and therein execute python -c "import torch; print(torch.cuda.is_available())" then I get True. Running the training script this way appears to work, so I thought I could use:

ENTRYPOINT ["/bin/bash", "-c", "python train_pytorch_model.py"].

And this would work, was it not for the problem that I need the command line arguments passed to Vertex AI to be passed as command-line arguments in train_pytorch_model.py. I tried:

ENTRYPOINT ["/bin/bash", "-c", "python train_pytorch_model.py $@"]

but $@ (and $1) appears to be blank.

What I am asking for is either:

  1. A way to make the ENTRYPOINT ["/bin/bash", "-c", "python train_pytorch_model.py"] workaround work with the command-line arguments, or
  2. A way to fix the original way, in which the command-line arguments are passed successfully, but it currently has that problem with cuda.

How can the same python executable behave differently in these scenarios? I must be missing something...

Thank you in advance.

Edit: Here is nvidia-smi output as additional info

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P0    30W /  70W |      0MiB / 15360MiB |     11%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Edit #2: By printing the environment variables in both cases I realized the following differences:

# bash version
LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

# python version
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64and

# bash version
PATH=/usr/local/nvm/versions/node/v16.20.0/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin

# python version
PATH=/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin

Could this be why? How do I fix it?


Solution

  • So I forced the correct values in the dockerfile by adding:

    ENV LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
    ENV PATH=/usr/local/nvm/versions/node/v16.20.0/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
    

    and now the python entrypoint finally works like the bash way did.