Search code examples
python-3.xwindows-10tensorflow2.0nsight-computensight-systems

Python & Tensorflow & CUDA Environment Setup Problems


I had tensorflow 2.2 working with Python 3.7.4 on Windows 10 Enterprise 64-bit yesterday, including using the GPU. This morning, the same system no longer sees the GPU. I have uninstalled/reinstalled CUDA, & the other requirements based on the tensorflow docs but it just refuses to work.

PC specs: i7 CPU 3.70GHz, 64GB RAM, NVidia GeForce GTX 780 Ti video card installed (driver 26.21.14.4122).

https://www.tensorflow.org/install/gpu says tensorflow requires NVidia CUDA Toolkit 10.1 specifically (not 10.0, not 10.2).

Naturally, that version refuses to install on my PC. these components fail during install:

  • Visual Studio Integration
  • NSight Systems
  • NSight Compute

So, I installed 10.2 which installs properly, but things don't run (which is not a surprise, given the tensorflow docs).

What's installed:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.22       Driver Version: 441.22       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 780 Ti WDDM  | 00000000:01:00.0 N/A |                  N/A |
| 27%   41C    P8    N/A /  N/A |    458MiB /  3072MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:32:27_Pacific_Daylight_Time_2019
Cuda compilation tools, release 10.2, V10.2.89

I know the nvcc output of 10.2.89 is not what I need, but it simply won't install 10.1 so I don't know what I can do. Is this a common problem? Is there a diagnostic I can run to ensure the card did not die? Should I downgrade my version of tensorflow? Should I abandon this environment all together? Is so, what is a stable environment to learn ML?


Solution

  • Below is how I got it working. Tensorflow 2.2.0, Windows 10, Python 3.7 (64-bit). Thanks again to Yahya for the gentle nudge towards this solution.

    Uninstall every bit of NVIDIA software.

    Install CUDA Toolkit 10.1. I did the Express Install of package cuda_10.1.243_win10_network.exe. Any other version of CUDA 10.1 did not install correctly.

    Install CUDNN package 7.6. Extract all files from cudnn-10.1-windows10-x64-v7.6.5.32 into the CUDA file structure (i.e. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1)

    Add these directories to your path variables (assuming that you did not alter the path during installation):

    • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin
    • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\libnvvp

    Reboot to initialize the Path variables.

    Uninstall all tensorflow variants via PIP.

    Install tensorflow 2.2 via PIP.

    Then you can run the code below in bash to confirm that tensorflow is able to access your video card

    # Check if tensorflow detects the GPU
    import tensorflow as tf
    from tensorflow.python.client import device_lib
    
    # Query tensorflow to see if it recognizes your GPU. This will output in the bash window
    physical_devices = tf.config.list_physical_devices()
    GPU_devices = tf.config.list_physical_devices('GPU')
    
    print("physical_devices:", physical_devices)
    print("Num GPUs:", len(GPU_devices))