Search code examples
linuxtensorflownvidiatensorrtnuma

[ LINUX ]Tensorflow-GPU not working - TF-TRT Warning: Could not find TensorRT


I have been struggling with downloading all the necessary drivers required for tesnorflow-gpu library. I want to compile my model using gpu instead of cpu. I am using Linux Mint. This is my neofetch

             ...-:::::-...                 
          .-MMMMMMMMMMMMMMM-.              ----------- 
      .-MMMM`..-:::::::-..`MMMM-.          OS: Linux Mint 21.3 x86_64 
    .:MMMM.:MMMMMMMMMMMMMMM:.MMMM:.        Kernel: 5.15.0-101-generic 
   -MMM-M---MMMMMMMMMMMMMMMMMMM.MMM-       Uptime: 2 hours, 33 mins 
 `:MMM:MM`  :MMMM:....::-...-MMMM:MMM:`    Packages: 3307 (dpkg), 13 (flatpak) 
 :MMM:MMM`  :MM:`  ``    ``  `:MMM:MMM:    Shell: bash 5.1.16 
.MMM.MMMM`  :MM.  -MM.  .MM-  `MMMM.MMM.   Resolution: 1920x1080 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   DE: Cinnamon 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM:MMM:   WM: Mutter (Muffin) 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   WM Theme: WhiteSur-Dark (Sweet-Dark-v40) 
.MMM.MMMM`  :MM:--:MM:--:MM:  `MMMM.MMM.   Theme: Sweet-Dark-v40 [GTK2/3] 
 :MMM:MMM-  `-MMMMMMMMMMMM-`  -MMM-MMM:    Icons: candy-icons [GTK2/3] 
  :MMM:MMM:`                `:MMM:MMM:     Terminal: gnome-terminal 
   .MMM.MMMM:--------------:MMMM.MMM.      CPU: Intel i5-3570 (4) @ 3.800GHz 
     '-MMMM.-MMMMMMMMMMMMMMM-.MMMM-'       GPU: NVIDIA GeForce GTX 1060 6GB 
       '.-MMMM``--:::::--``MMMM-.'         Memory: 2070MiB / 7883MiB 
            '-MMMMMMMMMMMMM-'
               ``-:::::-`` 

                                    
                                                               

And this is my nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off |   00000000:01:00.0  On |                  N/A |
| 25%   40C    P8              7W /  120W |     316MiB /   6144MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1142      G   /usr/lib/xorg/Xorg                            148MiB |
|    0   N/A  N/A      1881      G   cinnamon                                       45MiB |
|    0   N/A  N/A      9746      G   /app/extra/viber/Viber                         27MiB |
|    0   N/A  N/A     14935      G   ...seed-version=20240322-165906.502000         90MiB |
+-----------------------------------------------------------------------------------------+

I also installed tensorrt, cuDNN and tensorflow-gpu. Most installations were made using pip. Here is my tensorrt version

import tensorrt
print(tensorrt.__version__)
8.6.1
assert tensorrt.Builder(tensorrt.Logger())

The error I am recieving is the following:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

2024-03-25 12:49:24.151959: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-25 12:49:24.939265: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-25 12:49:24.973806: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

I am unsure if the conflict is due to a mismatch in the versions or due to the path, but when I echo the path I get echo $LD_LIBRARY_PATH :/home/vuk/miniconda3/lib/python3.1/site-packages/tensorrt_libs

This has been bugging me for months now...

I tried installing and uninstalling the libraries multiple times, configuring the path, installing tensorflow gpu through docker but nothing has worked so far. The issue might be in the mismatch of libraries that I am using but I am unsure...


Solution

  • I managed to solve this by creating two bash scripts for my conda environment. Inside your Conda environment directory, navigate to the etc/conda/activate.d and etc/conda/deactivate.d directories. If these directories do not exist, you can create them. Then, create a script file (e.g., set_env_vars.sh) in both directories.

    The first one is activate.d which goes like this:

    #!/bin/sh
    export NVIDIA_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")))
    export LD_LIBRARY_PATH=$(echo ${NVIDIA_DIR}/*/lib/ | sed -r 's/\s+/:/g')${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

    The second one is deactivate.d which contains :

    #!/bin/sh
    unset NVIDIA_DIR
    unset LD_LIBRARY_PATH

    Then I added execute permissions to both with

    chmod +x /path/to/your/conda/env/etc/conda/activate.d/set_env_vars.sh
    chmod +x /path/to/your/conda/env/etc/conda/deactivate.d/set_env_vars.sh