Search code examples
gpugoogle-compute-enginegoogle-container-optimized-os

COS install GPU failed to download driver signature


I used Compute Engine VM with T4 GPU for quite some time on COS and it has been working fine until recently that cos-extensions install gpu does not work like before.

I0830 07:32:58.419130     987 main.go:21] Checking if this is the only cos_gpu_installer that is running.
I0830 07:32:58.427417     987 install.go:74] Running on COS build id 16108.470.16
I0830 07:32:58.427566     987 installer.go:187] Getting the default GPU driver version
I0830 07:32:58.427911     987 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548403     987 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548594     987 install.go:85] Installing GPU driver version 450.119.04
I0830 07:32:58.549646     987 cache.go:72] map[BUILD_ID:16108.470.11 DRIVER_VERSION:450.119.04]
I0830 07:32:58.549674     987 install.go:120] Did not find cached version, installing the drivers...
I0830 07:32:58.549681     987 installer.go:82] Configuring driver installation directories
I0830 07:32:58.563327     987 installer.go:196] Updating container's ld cache
I0830 07:32:58.793692     987 signature.go:30] Downloading driver signature for version 450.119.04
I0830 07:32:58.793721     987 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/16108.470.16/extensions/gpu/450.119.04.signature.tar.gz
E0830 07:32:58.828902     987 artifacts.go:106] Failed to download extensions/gpu/450.119.04.signature.tar.gz from public GCS: failed to download 450.119.04.signature.tar.gz, status: 404 Not Found
E0830 07:32:58.829401     987 install.go:175] failed to download driver signature: failed to download driver signature for version 450.119.04: failed to download extensions/gpu/450.119.04.signature.tar.gz

It seems like the installer could not find the driver signature. I have looked into this and followed the workaround by doing

/usr/bin/docker run --rm \
    --privileged \
    --net=host \
    --pid=host \
    --volume /dev:/dev \
    --volume /:/root \
    --volume /var/lib/toolbox/nvidia:/usr/local/nvidia \
    --env NVIDIA_DRIVER_VERSION=450.119.04 \
    gcr.io/cos-cloud/cos-gpu-installer:latest

but got this instead

+ COS_KERNEL_INFO_FILENAME=kernel_info
+ COS_KERNEL_SRC_HEADER=kernel-headers.tgz
+ TOOLCHAIN_URL_FILENAME=toolchain_url
+ TOOLCHAIN_ENV_FILENAME=toolchain_env
+ TOOLCHAIN_PKG_DIR=/build/cos-tools
+ CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
+ ROOT_OS_RELEASE=/root/etc/os-release
+ KERNEL_SRC_HEADER=/build/usr/src/linux
+ NVIDIA_DRIVER_VERSION=450.119.04
+ NVIDIA_DRIVER_MD5SUM=
+ NVIDIA_INSTALL_DIR_HOST=/var/lib/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
+ LOCK_FILE=/root/tmp/cos_gpu_installer_lock
+ LOCK_FILE_FD=20
+ set +x
[INFO    2021-08-30 07:36:38 UTC] PRELOAD: false
[INFO    2021-08-30 07:36:38 UTC] Running on COS build id 16108.470.16
[INFO    2021-08-30 07:36:38 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/16108.470.16
[INFO    2021-08-30 07:36:38 UTC] Checking if this is the only cos-gpu-installer that is running.
[INFO    2021-08-30 07:36:38 UTC] Checking if third party kernel modules can be installed
/tmp/esp /
/
[INFO    2021-08-30 07:36:38 UTC] Checking cached version
/entrypoint.sh: line 172: CACHE_BUILD_ID: unbound variable

It seems like there are some changes going on with COS and COS GPU driver (maybe?), but just want to know whether there is a workaround on this problem apart from waiting GCP to solve things out.


Solution

  • This is the same case as the one Jan Vansteenlandt linked to.

    This happens in some versions of COS;

    For example latest stable COS version available now - 89-16108:

    vm-16108 ~ # cos-extensions list Available extensions for COS version
    89-16108.470.16:
    
    [gpu]
    

    There's no driver listed under [gpu] and running cos-extensions install gpu ends in the same way as in your case. When trying to run the docker container you mentioned also yielded the same results.

    This is a known issue and has already been raised on IssueTracker. You can fallow the link and click on +1 button, also you can comment and post your own findings in the thread.

    There's also a workaround in the thread so you may give it a go.

    If you can use some older version of COS (85-13310 for example) - the driver is listed:

    vm-13310 ~ # cos-extensions list
    Available extensions for COS version 85-13310.1308.10:
    
    [gpu]
    450.119.04 [default]
    

    And when you run cos-extensions install gpu it will result in succesfull installation of NVIDIA drivers:

    
    vm-13310 ~ # cos-extensions install gpu
    I0831 14:25:11.405591    1168 main.go:21] Checking if this is the only cos_gpu_installer that is running.
    I0831 14:25:11.407510    1168 install.go:74] Running on COS build id 13310.1308.10
    I0831 14:25:11.407519    1168 installer.go:187] Getting the default GPU driver version
    I0831 14:25:11.407581    1168 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/13310.1308.10/gpu_default_version
    I0831 14:25:11.448046    1168 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/13310.1308.10/gpu_default_version
    I0831 14:25:11.448539    1168 install.go:85] Installing GPU driver version 450.119.04
    I0831 14:25:11.448751    1168 cache.go:69] error: failed to read file /root/var/lib/nvidia/.cache: open /root/var/lib/nvidia/.cache: no such file or directory
    I0831 14:25:11.448942    1168 install.go:120] Did not find cached version, installing the drivers...
    I0831 14:25:11.449084    1168 installer.go:82] Configuring driver installation directories
    I0831 14:25:11.469718    1168 installer.go:196] Updating container's ld cache
    I0831 14:25:11.480682    1168 signature.go:30] Downloading driver signature for version 450.119.04
    I0831 14:25:11.481007    1168 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/13310.1308.10/extensions/gpu/450.119.04.signature.tar.gz
    I0831 14:25:11.506186    1168 utils.go:120] Successfully downloaded 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/13310.1308.10/extensions/gpu/450.119.04.signature.tar.gz
    I0831 14:25:11.506541    1168 signature.go:37] Decompressing signature /build/sign-gpu-driver/450.119.04.signature.tar.gz
    I0831 14:25:11.510104    1168 installer.go:68] Downloading GPU driver installer version 450.119.04
    I0831 14:25:11.511637    1168 utils.go:72] Downloading GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/85/tesla/450_00/450.119.04/NVIDIA-Linux-x86_64-450.119.04_85-13310-1308-10.cos
    I0831 14:25:12.885856    1168 utils.go:120] Successfully downloaded GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/85/tesla/450_00/450.119.04/NVIDIA-Linux-x86_64-450.119.04_85-13310-1308-10.cos
    
    -----  removed some lines for better readibility  -----
    
    I0831 14:28:49.433597    1168 cache.go:58] Updated cached version as
    I0831 14:28:49.498379    1168 cache.go:60] BUILD_ID=13310.1308.10
    I0831 14:28:49.498560    1168 cache.go:60] DRIVER_VERSION=450.119.04
    I0831 14:28:49.498694    1168 installer.go:32] Verifying GPU driver installation
    I0831 14:28:50.309502    1168 utils.go:334] Tue Aug 31 14:28:50 2021       
    I0831 14:28:50.309879    1168 utils.go:334] +-----------------------------------------------------------------------------+
    I0831 14:28:50.311093    1168 utils.go:334] | NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
    I0831 14:28:50.311300    1168 utils.go:334] |-------------------------------+----------------------+----------------------+
    I0831 14:28:50.311497    1168 utils.go:334] | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    I0831 14:28:50.311640    1168 utils.go:334] | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    I0831 14:28:50.311784    1168 utils.go:334] |                               |                      |               MIG M. |
    I0831 14:28:50.311949    1168 utils.go:334] |===============================+======================+======================|
    I0831 14:28:50.322257    1168 utils.go:334] |   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
    I0831 14:28:50.322566    1168 utils.go:334] | N/A   76C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
    I0831 14:28:50.322708    1168 utils.go:334] |                               |                      |                  N/A |
    I0831 14:28:50.322878    1168 utils.go:334] +-------------------------------+----------------------+----------------------+
    I0831 14:28:50.323119    1168 utils.go:334]                                                                                
    I0831 14:28:50.323293    1168 utils.go:334] +-----------------------------------------------------------------------------+
    I0831 14:28:50.323431    1168 utils.go:334] | Processes:                                                                  |
    I0831 14:28:50.323597    1168 utils.go:334] |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    I0831 14:28:50.323715    1168 utils.go:334] |        ID   ID                                                   Usage      |
    I0831 14:28:50.323863    1168 utils.go:334] |=============================================================================|
    I0831 14:28:50.324222    1168 utils.go:334] |  No running processes found                                                 |
    I0831 14:28:50.324439    1168 utils.go:334] +-----------------------------------------------------------------------------+
    I0831 14:28:50.465730    1168 modules.go:48] Updating host's ld cache
    I0831 14:28:52.305122    1168 install.go:167] Finished installing the drivers.