Search code examples
dockercudanvidiatritonservertriton

Cannot get CUDA device count, GPU metrics will not be available , Nvidia triton server issue in docker


I am trying to run nvidia inference server through docker I got the correct Image of triton server from docker

but when docker logs sample-tis-22.04 --tail 40

It shows this :

I0610 15:59:37.597914 1 server.cc:576]
+-------------+-------------------------------------------------------------------------+--------+
| Backend     | Path                                                                    | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so                 | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so         | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so         | {}     |
| openvino    | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {}     |
+-------------+-------------------------------------------------------------------------+--------+

I0610 15:59:37.597933 1 server.cc:619]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

W0610 15:59:37.635981 1 metrics.cc:634] Cannot get CUDA device count, GPU metrics will not be available
I0610 15:59:37.636226 1 tritonserver.cc:2123]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value
                                                                                                          |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton
                                                                                                          |
| server_version                   | 2.21.0
                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /models
                                                                                                          |
| model_control_mode               | MODE_NONE
                                                                                                          |
| strict_model_config              | 1
                                                                                                          |
| rate_limit                       | OFF
                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456
                                                                                                          |
| response_cache_byte_size         | 0
                                                                                                          |
| min_supported_compute_capability | 6.0
                                                                                                          |
| strict_readiness                 | 1
                                                                                                          |
| exit_timeout                     | 30
                                                                                                          |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0610 15:59:37.638384 1 grpc_server.cc:4544] Started GRPCInferenceService at 0.0.0.0:8001
I0610 15:59:37.638908 1 http_server.cc:3242] Started HTTPService at 0.0.0.0:8000
I0610 15:59:37.680861 1 http_server.cc:180] Started Metrics Service at 0.0.0.0:8002

(nvdiaTritonServer_env) E:\Github\triton_server_ImageModel>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:30:10_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

(nvdiaTritonServer_env) E:\Github\triton_server_ImageModel>nvidia-smi
Mon Jun 10 21:17:32 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85                 Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060      WDDM  |   00000000:05:00.0  On |                  N/A |
|  0%   49C    P8              9W /  170W |     736MiB /  12288MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1280    C+G   ...__8wekyb3d8bbwe\WindowsTerminal.exe      N/A      |
|    0   N/A  N/A      2568    C+G   ...siveControlPanel\SystemSettings.exe      N/A      |
|    0   N/A  N/A      2780    C+G   ...\Docker\frontend\Docker Desktop.exe      N/A      |
|    0   N/A  N/A      5840    C+G   C:\Windows\explorer.exe                     N/A      |
|    0   N/A  N/A      6212    C+G   ...al\Discord\app-1.0.9047\Discord.exe      N/A      |
|    0   N/A  N/A      7148    C+G   ...t.LockApp_cw5n1h2txyewy\LockApp.exe      N/A      |
|    0   N/A  N/A      7824    C+G   ...nt.CBS_cw5n1h2txyewy\SearchHost.exe      N/A      |
|    0   N/A  N/A      8068    C+G   ...2txyewy\StartMenuExperienceHost.exe      N/A      |
|    0   N/A  N/A     10332    C+G   ...on\125.0.2535.92\msedgewebview2.exe      N/A      |
|    0   N/A  N/A     10972    C+G   ...5n1h2txyewy\ShellExperienceHost.exe      N/A      |
|    0   N/A  N/A     13484    C+G   ...GeForce Experience\NVIDIA Share.exe      N/A      |
|    0   N/A  N/A     13712    C+G   ...CBS_cw5n1h2txyewy\TextInputHost.exe      N/A      |
|    0   N/A  N/A     18732    C+G   ....0_x64__8wekyb3d8bbwe\HxOutlook.exe      N/A      |
|    0   N/A  N/A     19024    C+G   ...7.0_x64__cv1g1gvanyjgm\WhatsApp.exe      N/A      |
+-----------------------------------------------------------------------------------------+

-- I am using it in anaconda env , I have properly installed cuda and cudnn and also check that nvcc --version is working correctly and outputing

but the log says metric cant be used and model, version and status all are empty despite having correct path .


Solution

  • Solved this issue: My GPU is rtx3060 And nvidia-smi output current driver version

    NVIDIA-SMI 555.85 Driver Version: 555.85 CUDA Version: 12.5

    My Docker version :

    Current version: 4.30.0 (149282)

    is not supporting the driver version 555.65 and cuda 12.5

    So Downgrade the NVIDIA Driver to 552.22 and cuda 12.4

    by downloading the drive from www.nvidia.com/download/driverResults.aspx/224154/en-us/

    Clean Install Only

    reboot the system

    then run the docker compose , GPU metric and device will be detected by the docker