So, I want to make a docker image where I train neural network models using pytorch with a GPU. The plan is to run this image on Vertex AI.
I am using this nvcr.io/nvidia/pytorch:23.07-py3
as a base image, since it should already have python 3.10 installed with pytorch and I can pip install the rest of my dependencies.
The docker file looks like this:
FROM nvcr.io/nvidia/pytorch:23.07-py3
# nvcr.io/nvidia/pytorch:23.07-py3 has python 3.10.6 and pytorch 2.1 pre-installed
# Specify working directory for next commands
WORKDIR /root
# copy my code to /root
...
# install pip dependencies (pytorch and cuda support should already be installed)
RUN pip install -r pip_requirements.txt
# THIS DOESN'T WORK FOR SOME REASON:
# ENTRYPOINT ["python", "train_pytorch_model.py"]
# This works but I can't pass command-line args to it because "$@" is blank
#ENTRYPOINT ["/bin/bash", "-c", "python train_pytorch_model.py $@"]
When I try to use ENTRYPOINT ["python", "train_pytorch_model.py"]
I am met with the error:
UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:108.)
However when I use ENTRYPOINT ["/bin/bash"]
, run the docker container and therein execute python -c "import torch; print(torch.cuda.is_available())"
then I get True
. Running the training script this way appears to work, so I thought I could use:
ENTRYPOINT ["/bin/bash", "-c", "python train_pytorch_model.py"]
.
And this would work, was it not for the problem that I need the command line arguments passed to Vertex AI to be passed as command-line arguments in train_pytorch_model.py
. I tried:
ENTRYPOINT ["/bin/bash", "-c", "python train_pytorch_model.py $@"]
but $@
(and $1) appears to be blank.
What I am asking for is either:
ENTRYPOINT ["/bin/bash", "-c", "python train_pytorch_model.py"]
workaround work with the command-line arguments, orHow can the same python executable behave differently in these scenarios? I must be missing something...
Thank you in advance.
Edit: Here is nvidia-smi output as additional info
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 50C P0 30W / 70W | 0MiB / 15360MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Edit #2: By printing the environment variables in both cases I realized the following differences:
# bash version
LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
# python version
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64and
# bash version
PATH=/usr/local/nvm/versions/node/v16.20.0/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
# python version
PATH=/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
Could this be why? How do I fix it?
So I forced the correct values in the dockerfile by adding:
ENV LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV PATH=/usr/local/nvm/versions/node/v16.20.0/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
and now the python entrypoint finally works like the bash way did.