Search code examples
tensorflowcudanvidia

How can I resolve Tensorflow warnings - cuDNN, cuFFT, cuBLAS and NUMA


I'm trying to setup my Ubuntu 22.04 machine (which has a NVIDIA GeForce GTX 1500 Ti Mobile) to run a Tensorflow project for my Master Thesis.

I've successfully installed the Nvidia Driver 535 and the Nvidia Cuda Tooolkit 12.2, using the instructions bellow.

sudo apt install nvidia-driver-535
wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run
sudo sh cuda_12.2.2_535.104.05_linux.run

To ensure that everything was properly installed, I've checked that my cuda files were located at /usr/local/cuda-12.2 with the symbolic link /usr/local/cuda pointing to it and ran both nvidia-smi and nvcc --version, which outputed:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050 Ti     Off | 00000000:01:00.0 Off |                  N/A |
| N/A   42C    P8              N/A /  30W |      7MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1863      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Finally, I've created a python 3.9.2 venv, installed tensorflow 2.17 (using pip install tensorflow[and-cuda]) and ran the bellow script:

import tensorflow as tf

def check_gpu():
    print("TensorFlow Version:", tf.__version__)

    # List all devices
    devices = tf.config.list_physical_devices()
    print("Physical devices:")
    for device in devices:
        print(device)

    # Check if GPU is available
    if tf.config.list_physical_devices('GPU'):
        print("GPU is available")
    else:
        print("GPU is not available")

    # Test TensorFlow GPU operation
    try:
        with tf.device('/GPU:0'):  # Specify GPU device
            a = tf.constant([1.0, 2.0, 3.0, 4.0])
            b = tf.constant([2.0, 2.0, 2.0, 2.0])
            c = a + b
            print("TensorFlow can run on GPU")
            print("Result of GPU operation:", c.numpy())
    except RuntimeError as e:
        print("Error using GPU:", e)

check_gpu()

Even though the script executed successfully and the operations were performed by the GPU, I have multiple warnings (cuDNN, cuFFT, cuBLAS and NUMA), as shown bellow.

2024-08-20 12:24:28.116134: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-20 12:24:28.129784: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-20 12:24:28.133871: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

I0000 00:00:1724153069.716493   10819 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

After browsing the internet, I tryed to manually the cuDNN files to my cuda istalltion by running the code bellow, but the warnings persisted.

wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.3.0.75_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-9.3.0.75_cuda12-archive.tar.xz
cd cudnn-linux-x86_64-9.3.0.75_cuda12-archive
sudo cp include/cudnn*.h /usr/local/cuda/include
sudo cp lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

If anyone has any idea on how to solve this warnings, I would very much appreciate it.

Thanks in advance!


Solution

  • After some headaches, I was finally able to solve all the warnings!

    Regarding the cuDNN, cuFFT and cuBLAS warnings:

    Basically I downgraded the NVIDIA Driver from 535 to 470, Cuda from 12.2 to 11.4, cuDNN from 9.3.0 to 8.9.7 and Tensorflow from 2.17 to 2.5. Even though this was not the ideal solution, it is still better than before, as it allows me to do the work I needed. Also, note that this are not the recommended versions from NVIDIA for my system, but are the ones that worked for me. Here are the steps I took:

    1. I removed every package and driver related to nvidia and cuda
    sudo apt remove --purge *nvidia* *cuda*
    sudo apt autoremove
    
    1. Then, I installed Nvidia Driver version 470
    sudo apt install nvidia-driver-470
    sudo reboot
    
    1. After, I installed Cuda 11.4 using the runfile for Ubuntu 20.04, which works fine for Ubuntu 22.04. Note that, during installation, a warning will pop-up about a driver already being installed. Just type continue, accept the NVIDIA terms and uncheck the driver option from the selection.
    wget https://developer.download.nvidia.com/compute/cuda/11.4.4/local_installers/cuda_11.4.4_470.82.01_linux.run
    sudo sh cuda_11.4.4_470.82.01_linux.run
    echo export PATH=/usr/local/cuda/bin:\$PATH >> ~/.bashrc
    echo export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH >> ~/.bashrc
    sudo reboot
    
    1. Finally, I manually added the cuDNN 8.9.7 files to the Cuda installation directory
    wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.7.29_cuda11-archive.tar.xz
    tar xf cudnn-linux-x86_64-8.9.7.29_cuda11-archive.tar.xz
    sudo cp cudnn-linux-x86_64-8.9.7.29_cuda11-archive/include/cudnn*.h /usr/local/cuda/include
    sudo cp cudnn-linux-x86_64-8.9.7.29_cuda11-archive/lib/libcudnn* /usr/local/cuda/lib64
    sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
    

    Regarding the NUMA warnings:

    I found this post by zrruziev in GitHub Gist, which presents a step-by-step solution on how to solve the warning. Nonetheless, since the proposed solution needs to be run by hand at each reboot, I took the script given by enonu and create a service that does this automatically. The steps are:

    1. Create bash script file
    sudo nano /usr/local/bin/set_numa.sh
    
    #!/bin/bash
    
    if [[ "$EUID" -ne 0 ]]; then
      exit 1
    fi
    
    PCI_ID=$(lspci | grep "VGA compatible controller: NVIDIA Corporation" | cut -d' ' -f1)
    PCI_ID="0000:$PCI_ID"
    FILE=/sys/bus/pci/devices/$PCI_ID/numa_node
    
    if [[ -f "$FILE" ]]; then
      CURRENT_VAL=$(cat $FILE)
      if [[ "$CURRENT_VAL" -eq -1 ]]; then
        echo 0 > $FILE
      fi
    else
      exit 1
    fi
    
    1. Make the script executable
    sudo chmod +x /usr/local/bin/set_numa.sh
    
    1. Create systemd service file
    sudo nano /etc/systemd/system/set_numa.service
    
    [Unit]
    Description=Set NUMA Node for NVIDIA GPU
    After=multi-user.target
    
    [Service]
    ExecStart=/usr/local/bin/set_numa.sh
    Type=oneshot
    RemainAfterExit=yes
    
    [Install]
    WantedBy=multi-user.target
    
    1. Start the service so that it runs on boot. Note that you need to reboot in order for changes to take effect.
    sudo systemctl enable set_numa
    sudo systemctl start set_numa
    

    Verify everything is working:

    After rebooting my machine, to verify if everything was working fine, I ran nvidia-smi, nvcc --version, sudo systemctl status set_numa and the script from my initial post. The outputs are as follows:

    $ nvidia-smi
    Sat Aug 24 17:26:20 2024       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
    | N/A   51C    P0    N/A /  N/A |    465MiB /  4040MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    $ nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2021 NVIDIA Corporation
    Built on Mon_Oct_11_21:27:02_PDT_2021
    Cuda compilation tools, release 11.4, V11.4.152
    Build cuda_11.4.r11.4/compiler.30521435_0
    
    $ sudo systemctl status set_numa
    ● set_numa.service - Set NUMA Node for NVIDIA GPU
         Loaded: loaded (/etc/systemd/system/set_numa.service; enabled; vendor preset: enabled)
         Active: active (exited) since Sat 2024-08-24 23:19:47 WEST; 34min ago
        Process: 1741 ExecStart=/usr/local/bin/set_numa.sh (code=exited, status=0/SUCCESS)
       Main PID: 1741 (code=exited, status=0/SUCCESS)
            CPU: 15ms
    
    ago 24 23:19:47 <my-machine-name> systemd[1]: Starting Set NUMA Node for NVIDIA GPU...
    ago 24 23:19:47 <my-machine-name> systemd[1]: Finished Set NUMA Node for NVIDIA GPU.
    
    $ python test_script.py
    2024-08-24 23:55:54.496175: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
    TensorFlow Version: 2.5.0
    2024-08-24 23:55:56.033467: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
    2024-08-24 23:55:56.054557: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
    pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1050 Ti computeCapability: 6.1
    coreClock: 1.62GHz coreCount: 6 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 104.43GiB/s
    2024-08-24 23:55:56.054592: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
    2024-08-24 23:55:56.071430: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
    2024-08-24 23:55:56.071482: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
    2024-08-24 23:55:56.078799: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
    2024-08-24 23:55:56.080862: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
    2024-08-24 23:55:56.082734: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
    2024-08-24 23:55:56.086919: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
    2024-08-24 23:55:56.087933: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
    2024-08-24 23:55:56.089044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
    Physical devices:
    PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
    PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
    GPU is available
    2024-08-24 23:55:56.090068: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2024-08-24 23:55:56.091047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
    pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1050 Ti computeCapability: 6.1
    coreClock: 1.62GHz coreCount: 6 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 104.43GiB/s
    2024-08-24 23:55:56.091593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
    2024-08-24 23:55:56.092275: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
    2024-08-24 23:55:56.895543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
    2024-08-24 23:55:56.895603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
    2024-08-24 23:55:56.895612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
    2024-08-24 23:55:56.896557: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2975 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
    TensorFlow can run on GPU
    Result of GPU operation: [3. 4. 5. 6.]
    

    Hope this helps anyone in my spot.

    Best regards and good programming!