Search code examples
pythontensorflowgpu

Cuda 12 + tf-nightly 2.12: Could not find cuda drivers on your machine, GPU will not be used, while every checking is fine and in torch it works


  • tf-nightly version = 2.12.0-dev2023203
  • Python version = 3.10.6
  • CUDA drivers version = 525.85.12
  • CUDA version = 12.0
  • Cudnn version = 8.5.0
  • I am using Linux (x86_64, Ubuntu 22.04)
  • I am coding in Visual Studio Code on a venv virtual environment

I am trying to run some models on the GPU (NVIDIA GeForce RTX 3050) using tensorflow nightly 2.12 (to be able to use Cuda 12.0). The problem that I have is that apparently every checking that I am making seems to be correct, but in the end the script is not able to detect the GPU. I've dedicated a lot of time trying to see what is happening and nothing seems to work, so any advice or solution will be more than welcomed. The GPU seems to be working for torch as you can see at the very end of the question.

I will show some of the most common checkings regarding CUDA that I did (Visual Studio Code terminal), I hope you find them useful:

  1. Check CUDA version:

    $ nvcc --version

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2023 NVIDIA Corporation
    Built on Fri_Jan__6_16:45:21_PST_2023
    Cuda compilation tools, release 12.0, V12.0.140
    Build cuda_12.0.r12.0/compiler.32267302_0
    
  2. Check if the connection with the CUDA libraries is correct:

    $ echo $LD_LIBRARY_PATH

    /usr/cuda/lib
    
  3. Check nvidia drivers for the GPU and check if GPU is readable for the venv:

    $ nvidia-smi

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
    | N/A   40C    P5     6W /  20W |     46MiB /  4096MiB |     22%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A      1356      G   /usr/lib/xorg/Xorg                 45MiB |
    +-----------------------------------------------------------------------------+
    
  4. Add cuda/bin PATH and Check it:

    $ export PATH="/usr/local/cuda/bin:$PATH"

    $ echo $PATH

    /usr/local/cuda-12.0/bin:/home/victus-linux/Escritorio/MasterThesis_CODE/to_share/venv_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
    
  5. Custom function to check if CUDA is correctly installed: [function by Sherlock]

    function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; }
    function check() { lib_installed $1 && echo "$1 is installed" || echo "ERROR: $1 is NOT installed"; }
    check libcuda
    check libcudart
    
    libcudart.so.12 -> libcudart.so.12.0.146
            libcuda.so.1 -> libcuda.so.525.85.12
            libcuda.so.1 -> libcuda.so.525.85.12
            libcudadebugger.so.1 -> libcudadebugger.so.525.85.12
    libcuda is installed
            libcudart.so.12 -> libcudart.so.12.0.146
    libcudart is installed
    
  6. Custom function to check if Cudnn is correctly installed: [function by Sherlock]

    function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; }
    function check() { lib_installed $1 && echo "$1 is installed" || echo "ERROR: $1 is NOT installed"; }
    check libcudnn 
    
            libcudnn_cnn_train.so.8 -> libcudnn_cnn_train.so.8.8.0
            libcudnn_cnn_infer.so.8 -> libcudnn_cnn_infer.so.8.8.0
            libcudnn_adv_train.so.8 -> libcudnn_adv_train.so.8.8.0
            libcudnn.so.8 -> libcudnn.so.8.8.0
            libcudnn_ops_train.so.8 -> libcudnn_ops_train.so.8.8.0
            libcudnn_adv_infer.so.8 -> libcudnn_adv_infer.so.8.8.0
            libcudnn_ops_infer.so.8 -> libcudnn_ops_infer.so.8.8.0
    libcudnn is installed
    

So, once I did these previous checkings I used a script to evaluate if everything was finally ok and then the following error appeared:

import tensorflow as tf

print(f'\nTensorflow version = {tf.__version__}\n')
print(f'\n{tf.config.list_physical_devices("GPU")}\n')
2023-03-02 12:05:09.463343: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-02 12:05:09.489911: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-02 12:05:09.490522: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 12:05:10.066759: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Tensorflow version = 2.12.0-dev20230203

2023-03-02 12:05:10.748675: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-03-02 12:05:10.771263: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

[]

Extra check: I tried to run a checking script on torch and in here it worked so I guess the problem is related with tensorflow/tf-nightly

import torch

print(f'\nAvailable cuda = {torch.cuda.is_available()}')

print(f'\nGPUs availables = {torch.cuda.device_count()}')

print(f'\nCurrent device = {torch.cuda.current_device()}')

print(f'\nCurrent Device location = {torch.cuda.device(0)}')

print(f'\nName of the device = {torch.cuda.get_device_name(0)}')
Available cuda = True

GPUs availables = 1

Current device = 0

Current Device location = <torch.cuda.device object at 0x7fbe26fd2ec0>

Name of the device = NVIDIA GeForce RTX 3050 Laptop GPU

Please, if you know something that might help solve this issue, don't hesitate on telling me.


Solution

  • I think that, as of March 2023, the only tensorflow distribution for cuda 12 is the docker package from NVIDIA.

    A tf package for cuda 12 should show the following info

    >>> tf.sysconfig.get_build_info() 
    OrderedDict([('cpu_compiler', '/usr/bin/x86_64-linux-gnu-gcc-11'), 
    ('cuda_compute_capabilities', ['compute_86']), 
    ('cuda_version', '12.0'), ('cudnn_version', '8'), 
    ('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)])
    

    But if we run tf.sysconfig.get_build_info() on any tensorflow package installed via pip, it stills tells that cuda_version is 11.x

    So your alternatives are:

    • install docker with the nvidia cloud instructions and run one of the recent containers
    • compile tensorflow from source, either nightly or last release. Caveat, it takes a lot of RAM and some time, as all good compilations do, and the occasional error to be corrected on the run. In my case, to define kFP8, the new 8-bits float.
    • wait