docker tensorflow cuda nvidia google-ai-platform

Could not load dynamic library libcuda.so.1 error on Google AI Platform with custom container

I'm trying to launch a training job on Google AI Platform with a custom container. As I want to use GPUs for the training, the base image I've used for my container is:

FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04

With this image (and tensorflow 2.4.1 installed on top of that) I thought I can use the GPUs on AI Platform but it does not seem to be the case. When training starts, the logs are showing following:

W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6): /proc/driver/nvidia/version does not exist
I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.

Is this a good way to build an image to use GPUs on Google AI Platform? Or should I try instead to rely on a tensorflow image and install manually all the needed drivers to exploit GPUs?

EDIT: I read here (https://cloud.google.com/ai-platform/training/docs/containers-overview) the following:

For training with GPUs, your custom container needs to meet a few
special requirements. You must build a different Docker image than     
what you'd use for training with CPUs.

Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the 
nvidia/cuda image as your base image is the recommended way to handle 
this. It has the matching versions of CUDA toolkit and cuDNN pre-
installed, and it helps you set up the related environment variables 
correctly.

Install your training application, along with your required ML     
framework and other dependencies in your Docker image.

They also give a Dockerfile example here for training with GPUs. So what I did seems ok. Unfortunately I still have these errors mentioned above that could explain (or not) why I cannot use GPUs on Google AI Platform.

EDIT2: As read here (https://www.tensorflow.org/install/gpu) my Dockerfile is now:

FROM tensorflow/tensorflow:2.4.1-gpu
RUN apt-get update && apt-get install -y \
 lsb-release \
 vim \
 curl \
 git \
 libgl1-mesa-dev \
 software-properties-common \
 wget && \
 rm -rf /var/lib/apt/lists/*

# Add NVIDIA package repositories
RUN wget -nv https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
RUN add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
RUN apt-get update

RUN wget -nv http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

RUN apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
RUN apt-get update

# Install NVIDIA driver
RUN apt-get install -y --no-install-recommends nvidia-driver-450
# Reboot. Check that GPUs are visible using the command: nvidia-smi

RUN wget -nv https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt-get update

# Install development and runtime libraries (~4GB)
RUN apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0


# other stuff

Problem is that build freezes at the stage of what seems to be a keyboard-configuration. System asks to select a country and when I enter the number, nothing happens

Solution

The suggested way to build the most reliable container is to use the officially maintained 'Deep Learning Containers'. I would suggest pulling 'gcr.io/deeplearning-platform-release/tf2-gpu.2-4'. This should already have CUDA, CUDNN, GPU Drivers, and TF 2.4 installed & tested. You'll just need to add your code into it.