Search code examples
pythondockertensorflowgpu

Tensorflow Docker Not Using GPU


I'm trying to get Tensorflow working on my Ubuntu 24.04.1 with a GPU.

According to this page:

Docker is the easiest way to run TensorFlow on a GPU since the host machine only requires the NVIDIA® driver

So I'm trying to use Docker.

I'm checking to ensure my GPU is working with Docker by running docker run --gpus all --rm nvidia/cuda:12.6.2-cudnn-runtime-ubuntu24.04 nvidia-smi. The output of that is:

==========
== CUDA ==
==========

CUDA Version 12.6.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Sat Oct 26 01:16:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN RTX               Off |   00000000:01:00.0 Off |                  N/A |
| 41%   40C    P8             24W /  280W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

(Side note, I'm not using the command they suggest because docker run --gpus all --rm nvidia/cuda nvidia-smi doesn't work due to nvidia/cuda not having a latest tag anymore)

So it looks to be working. However when I run:

docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu \
   python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

The output is:

2024-10-26 01:20:51.021242: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1729905651.033544       1 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1729905651.037491       1 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-26 01:20:51.050486: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
W0000 00:00:1729905652.350499       1 gpu_device.cc:2344] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

Which indicates that there is no GPU detected by Tensorflow.

What am I doing wrong here?


Solution

  • I don't think you're doing anything wrong, but I'm concerned that the image may be a "pip install" short of a complete image.

    I'm running a different flavor of linux, but to start off with I had to make sure I had my gpu available to docker (see here Add nvidia runtime to docker runtimes ) and I upgraded my cuda version to the latest.

    Even after doing all this I had the same error as you.

    So I logged into the container as follows: docker run -it --rm --runtime=nvidia --gpus all tensorflow/tensorflow:latest-gpu /bin/bash

    and ran pip install tensorflow[and-cuda]

    Some of the dependencies were there and some or the dependencies had to be installed because they were missing. This is undesireable because you'd expect everything necessary to be there to run (maybe they'll fix the image in the future)

    After it finished I ran python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" and it finally found my GPU

    You're going to want to create your own docker image using their docker image as a base. So your dockerfile may look like something like:

    # Use the official TensorFlow GPU base image
    FROM tensorflow/tensorflow:latest-gpu
    
    # Install TensorFlow with CUDA support
    RUN pip install tensorflow[and-cuda]
    
    # Shell
    CMD ["bash"]