I'm running on a server with a A100 GPU. When trying to run tensorflow code after a server reset, tensorflow does not recognize the GPU. Running tf.config.list_physical_devices('GPU')
yields CUDA_ERROR_NOT_INITIALIZED
:
2021-09-09 07:41:42.956917: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-09-09 07:41:43.899014: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2021-09-09 07:41:43.899148: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: f42a3aa12bd1
2021-09-09 07:41:43.899169: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: f42a3aa12bd1
2021-09-09 07:41:43.899890: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.32.3
2021-09-09 07:41:43.899955: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.32.3
2021-09-09 07:41:43.899969: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.32.3
Running nvidia-smi
:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:00:06.0 Off | On |
| N/A 46C P0 40W / 250W | 0MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Why do I get CUDA_ERROR_NOT_INITIALIZED
? The server ran perfectly well before the reset, and nvidia-smi is clearly working.
It seems NVIDIA Multi-Instance GPU (MIG) is enabled on your GPU, but you haven't defined any GPU instances. This can be seen from the fact that nvidia-smi
shows a MIG devices
table, but it's empty (No MIG devices found
).
The MIG documentation states:
Without creating GPU instances (and corresponding compute instances), CUDA workloads cannot be run on the GPU. In other words, simply enabling MIG mode on the GPU is not sufficient. Also note that, the created MIG devices are not persistent across system reboots. Thus, the user or system administrator needs to recreate the desired MIG configurations if the GPU or system is reset.
You probably had a MIG configuration defined before the reset, but the server reset removed that configuration. You need to re-configure the GPU instances to get the GPU working again. If you just want a basic configuration, in which you have only one GPU instance that uses all the resources, you can run:
sudo nvidia-smi mig -cgi 0 -C
If you need a fancier configuration than that, you should consult the documentation.
After configuring the GPU instances, the nvidia-smi
command should show the MIG devices
table full. In our case, it should have one entry:
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 0 0 0 | 0MiB / 40536MiB | 98 0 | 7 0 5 1 1 |
| | 1MiB / 65536MiB | | |
+------------------+----------------------+-----------+-----------------------+