Search code examples
tensorflowubuntu-18.04nvidia

Tensorflow complains that no CUDA-capable device is detected


I'm trying to run some Tensorflow code, and I get what seems to be a common problem:

$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 20:36:15.903204: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 20:36:15.908809: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-02-06 20:36:15.908858: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: tigris
2019-02-06 20:36:15.908868: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: tigris
2019-02-06 20:36:15.908942: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 390.77.0
2019-02-06 20:36:15.908985: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.30.0
2019-02-06 20:36:15.909006: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:308] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
$

The key pieces of that error message seem to be:

[...] libcuda reported version is: 390.77.0
[...] kernel reported version is: 390.30.0
[...] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration

How can I install compatible versions? Where is that libcuda version coming from?

Background

A few months ago, I tried installing Tensorflow with GPU support, but the versions either broke my display or wouldn't work with Tensorflow. Finally, I got it working by following a tutorial on how to install multiple versions of the CUDA libraries on the same machine. That worked at the time, but when I came back to the project after a few months, it has stopped working. I assume that some driver got upgraded during that time.

Investigation

The first thing I tried was to see what versions I have of the nvidia drivers and libcuda package.

$ dpkg --list|grep libcuda
ii  libcuda1-390                                                390.30-0ubuntu1                              amd64        NVIDIA CUDA runtime library

Looks like it's 390.30. Why does the error message say that libcuda reported 390.77?

$ dpkg --list|grep nvidia
ii  libnvidia-container-tools                                   1.0.1-1                                      amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                                  1.0.1-1                                      amd64        NVIDIA container runtime library
rc  nvidia-384                                                  384.130-0ubuntu0.16.04.1                     amd64        NVIDIA binary driver - version 384.130
ii  nvidia-390                                                  390.30-0ubuntu1                              amd64        NVIDIA binary driver - version 390.30
ii  nvidia-390-dev                                              390.30-0ubuntu1                              amd64        NVIDIA binary Xorg driver development files
rc  nvidia-396                                                  396.44-0ubuntu1                              amd64        NVIDIA binary driver - version 396.44
ii  nvidia-container-runtime                                    2.0.0+docker18.09.1-1                        amd64        NVIDIA container runtime
ii  nvidia-container-runtime-hook                               1.4.0-1                                      amd64        NVIDIA container runtime hook
ii  nvidia-docker2                                              2.0.3+docker18.09.1-1                        all          nvidia-docker CLI wrapper
ii  nvidia-modprobe                                             390.30-0ubuntu1                              amd64        Load the NVIDIA kernel driver and create device files
rc  nvidia-opencl-icd-384                                       384.130-0ubuntu0.16.04.1                     amd64        NVIDIA OpenCL ICD
ii  nvidia-opencl-icd-390                                       390.30-0ubuntu1                              amd64        NVIDIA OpenCL ICD
rc  nvidia-opencl-icd-396                                       396.44-0ubuntu1                              amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                                0.8.8.2                                      all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                                             396.44-0ubuntu1                              amd64        Tool for configuring the NVIDIA graphics driver

Again, everything looks like it's 390.30. There were some packages that had version 390.77, but they were in the rc status. I guess I installed that version and later removed it, so the configuration files were left behind. I purged the configuration files with commands like this:

sudo apt-get remove --purge nvidia-kernel-common-390

Now, there are no packages at all with version 390.77.

$ dpkg --list|grep 390.77
$

I tried reinstalling CUDA, to see if it had been compiled with the wrong version.

$ sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-9.0 --override

That didn't make any difference.

Finally, I tried running nvidia-smi.

$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
$

All of this is running on Ubuntu 18.04 with Python 3.6.7, and my graphics card is NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2).


Solution

  • I finally had the idea to look for any files with 390.77 in the name.

    $ locate 390.77
    /usr/lib/i386-linux-gnu/libcuda.so.390.77
    /usr/lib/i386-linux-gnu/libnvcuvid.so.390.77
    /usr/lib/i386-linux-gnu/libnvidia-compiler.so.390.77
    /usr/lib/i386-linux-gnu/libnvidia-encode.so.390.77
    /usr/lib/i386-linux-gnu/libnvidia-fatbinaryloader.so.390.77
    /usr/lib/i386-linux-gnu/libnvidia-ml.so.390.77
    /usr/lib/i386-linux-gnu/libnvidia-opencl.so.390.77
    /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
    /usr/lib/i386-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
    /usr/lib/x86_64-linux-gnu/libcuda.so.390.77
    /usr/lib/x86_64-linux-gnu/libnvcuvid.so.390.77
    /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.390.77
    /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.390.77
    /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.77
    /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.77
    /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.77
    /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
    /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
    

    So there they are! A closer look shows that I must have installed the newer version at some point.

    $ ls /usr/lib/i386-linux-gnu/libcuda* -l
    lrwxrwxrwx 1 root root      12 Nov  8 13:58 /usr/lib/i386-linux-gnu/libcuda.so -> libcuda.so.1
    lrwxrwxrwx 1 root root      17 Nov 12 14:04 /usr/lib/i386-linux-gnu/libcuda.so.1 -> libcuda.so.390.77
    -rw-r--r-- 1 root root 9179124 Jan 31  2018 /usr/lib/i386-linux-gnu/libcuda.so.390.30
    -rw-r--r-- 1 root root 9179796 Jul 10  2018 /usr/lib/i386-linux-gnu/libcuda.so.390.77
    

    Where did they come from?

    $ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.30
    libcuda1-390: /usr/lib/i386-linux-gnu/libcuda.so.390.30
    $ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.77
    dpkg-query: no path found matching pattern /usr/lib/i386-linux-gnu/libcuda.so.390.77
    

    So the 390.77 no longer belongs to any package. Perhaps I installed the old version and had to force it to overwrite the links.

    My plan is to delete the files, then reinstall the packages to set up the links to the correct version. So which packages will I need to reinstall?

    $ locate 390.77|sed -e 's/390.77/390.30/'|xargs dpkg -S
    

    Some of the files don't match anything, but the ones that do match are from these packages:

    • libcuda1-390
    • nvidia-opencl-icd-390

    Crossing my fingers, I delete the version 390.77 files.

    locate 390.77|sudo xargs rm
    

    Then I reinstall the packages.

    sudo apt-get install --reinstall libcuda1-390 nvidia-opencl-icd-390
    

    Finally, it works!

    $ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
    2019-02-06 22:13:59.460822: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
    2019-02-06 22:13:59.665756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2019-02-06 22:13:59.666205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
    name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
    pciBusID: 0000:01:00.0
    totalMemory: 3.95GiB freeMemory: 3.81GiB
    2019-02-06 22:13:59.666226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
    2019-02-06 22:17:21.254445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
    2019-02-06 22:17:21.254489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
    2019-02-06 22:17:21.254496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
    2019-02-06 22:17:21.290992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3539 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
    

    nvidia-smi also works now.

    $ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
    Wed Feb  6 22:19:24 2019       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 960M    Off  | 00000000:01:00.0 Off |                  N/A |
    | N/A   45C    P8    N/A /  N/A |    113MiB /  4046MiB |      6%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0      3212      G   /usr/lib/xorg/Xorg                           113MiB |
    +-----------------------------------------------------------------------------+
    

    I rebooted, and the video drivers continued to work. Hurrah!

    Update 2023

    I tried going through this installation again, and I think I got a version of CUDA that's too new for Tensorflow. To see which version of CUDA Tensorflow was compiled with:

    python -c "import tensorflow.sysconfig; print(tensorflow.sysconfig.get_build_info()['cuda_version'])"
    

    The API has evolved, so to get the old Session class, use this command:

    LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.compat.v1.Session()"
    

    I found tensorflow installation instructions that gave me the final steps: pip installing nvidia-cudnn-cu11 and adding another folder to LD_LIBRARY_PATH. I also found a better test: listing the GPU devices.

    $ echo $(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
    /path/to/venv/lib/python3.10/site-packages/nvidia/cudnn
    $ LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:/path/to/venv/lib/python3.10/site-packages/nvidia/cudnn/lib python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
    ...
    [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
    

    Maybe using conda would make this easier, but I didn't try.