I used Compute Engine VM with T4 GPU for quite some time on COS and it has been working fine until recently that cos-extensions install gpu
does not work like before.
I0830 07:32:58.419130 987 main.go:21] Checking if this is the only cos_gpu_installer that is running.
I0830 07:32:58.427417 987 install.go:74] Running on COS build id 16108.470.16
I0830 07:32:58.427566 987 installer.go:187] Getting the default GPU driver version
I0830 07:32:58.427911 987 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548403 987 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548594 987 install.go:85] Installing GPU driver version 450.119.04
I0830 07:32:58.549646 987 cache.go:72] map[BUILD_ID:16108.470.11 DRIVER_VERSION:450.119.04]
I0830 07:32:58.549674 987 install.go:120] Did not find cached version, installing the drivers...
I0830 07:32:58.549681 987 installer.go:82] Configuring driver installation directories
I0830 07:32:58.563327 987 installer.go:196] Updating container's ld cache
I0830 07:32:58.793692 987 signature.go:30] Downloading driver signature for version 450.119.04
I0830 07:32:58.793721 987 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/16108.470.16/extensions/gpu/450.119.04.signature.tar.gz
E0830 07:32:58.828902 987 artifacts.go:106] Failed to download extensions/gpu/450.119.04.signature.tar.gz from public GCS: failed to download 450.119.04.signature.tar.gz, status: 404 Not Found
E0830 07:32:58.829401 987 install.go:175] failed to download driver signature: failed to download driver signature for version 450.119.04: failed to download extensions/gpu/450.119.04.signature.tar.gz
It seems like the installer could not find the driver signature. I have looked into this and followed the workaround by doing
/usr/bin/docker run --rm \
--privileged \
--net=host \
--pid=host \
--volume /dev:/dev \
--volume /:/root \
--volume /var/lib/toolbox/nvidia:/usr/local/nvidia \
--env NVIDIA_DRIVER_VERSION=450.119.04 \
gcr.io/cos-cloud/cos-gpu-installer:latest
but got this instead
+ COS_KERNEL_INFO_FILENAME=kernel_info
+ COS_KERNEL_SRC_HEADER=kernel-headers.tgz
+ TOOLCHAIN_URL_FILENAME=toolchain_url
+ TOOLCHAIN_ENV_FILENAME=toolchain_env
+ TOOLCHAIN_PKG_DIR=/build/cos-tools
+ CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
+ ROOT_OS_RELEASE=/root/etc/os-release
+ KERNEL_SRC_HEADER=/build/usr/src/linux
+ NVIDIA_DRIVER_VERSION=450.119.04
+ NVIDIA_DRIVER_MD5SUM=
+ NVIDIA_INSTALL_DIR_HOST=/var/lib/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
+ LOCK_FILE=/root/tmp/cos_gpu_installer_lock
+ LOCK_FILE_FD=20
+ set +x
[INFO 2021-08-30 07:36:38 UTC] PRELOAD: false
[INFO 2021-08-30 07:36:38 UTC] Running on COS build id 16108.470.16
[INFO 2021-08-30 07:36:38 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/16108.470.16
[INFO 2021-08-30 07:36:38 UTC] Checking if this is the only cos-gpu-installer that is running.
[INFO 2021-08-30 07:36:38 UTC] Checking if third party kernel modules can be installed
/tmp/esp /
/
[INFO 2021-08-30 07:36:38 UTC] Checking cached version
/entrypoint.sh: line 172: CACHE_BUILD_ID: unbound variable
It seems like there are some changes going on with COS and COS GPU driver (maybe?), but just want to know whether there is a workaround on this problem apart from waiting GCP to solve things out.
This is the same case as the one Jan Vansteenlandt linked to.
This happens in some versions of COS;
For example latest stable COS version available now - 89-16108:
vm-16108 ~ # cos-extensions list Available extensions for COS version
89-16108.470.16:
[gpu]
There's no driver listed under [gpu]
and running cos-extensions install gpu
ends in the same way as in your case. When trying to run the docker container you mentioned also yielded the same results.
This is a known issue and has already been raised on IssueTracker. You can fallow the link and click on +1
button, also you can comment and post your own findings in the thread.
There's also a workaround in the thread so you may give it a go.
If you can use some older version of COS (85-13310 for example) - the driver is listed:
vm-13310 ~ # cos-extensions list
Available extensions for COS version 85-13310.1308.10:
[gpu]
450.119.04 [default]
And when you run cos-extensions install gpu
it will result in succesfull installation of NVIDIA drivers:
vm-13310 ~ # cos-extensions install gpu
I0831 14:25:11.405591 1168 main.go:21] Checking if this is the only cos_gpu_installer that is running.
I0831 14:25:11.407510 1168 install.go:74] Running on COS build id 13310.1308.10
I0831 14:25:11.407519 1168 installer.go:187] Getting the default GPU driver version
I0831 14:25:11.407581 1168 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/13310.1308.10/gpu_default_version
I0831 14:25:11.448046 1168 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/13310.1308.10/gpu_default_version
I0831 14:25:11.448539 1168 install.go:85] Installing GPU driver version 450.119.04
I0831 14:25:11.448751 1168 cache.go:69] error: failed to read file /root/var/lib/nvidia/.cache: open /root/var/lib/nvidia/.cache: no such file or directory
I0831 14:25:11.448942 1168 install.go:120] Did not find cached version, installing the drivers...
I0831 14:25:11.449084 1168 installer.go:82] Configuring driver installation directories
I0831 14:25:11.469718 1168 installer.go:196] Updating container's ld cache
I0831 14:25:11.480682 1168 signature.go:30] Downloading driver signature for version 450.119.04
I0831 14:25:11.481007 1168 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/13310.1308.10/extensions/gpu/450.119.04.signature.tar.gz
I0831 14:25:11.506186 1168 utils.go:120] Successfully downloaded 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/13310.1308.10/extensions/gpu/450.119.04.signature.tar.gz
I0831 14:25:11.506541 1168 signature.go:37] Decompressing signature /build/sign-gpu-driver/450.119.04.signature.tar.gz
I0831 14:25:11.510104 1168 installer.go:68] Downloading GPU driver installer version 450.119.04
I0831 14:25:11.511637 1168 utils.go:72] Downloading GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/85/tesla/450_00/450.119.04/NVIDIA-Linux-x86_64-450.119.04_85-13310-1308-10.cos
I0831 14:25:12.885856 1168 utils.go:120] Successfully downloaded GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/85/tesla/450_00/450.119.04/NVIDIA-Linux-x86_64-450.119.04_85-13310-1308-10.cos
----- removed some lines for better readibility -----
I0831 14:28:49.433597 1168 cache.go:58] Updated cached version as
I0831 14:28:49.498379 1168 cache.go:60] BUILD_ID=13310.1308.10
I0831 14:28:49.498560 1168 cache.go:60] DRIVER_VERSION=450.119.04
I0831 14:28:49.498694 1168 installer.go:32] Verifying GPU driver installation
I0831 14:28:50.309502 1168 utils.go:334] Tue Aug 31 14:28:50 2021
I0831 14:28:50.309879 1168 utils.go:334] +-----------------------------------------------------------------------------+
I0831 14:28:50.311093 1168 utils.go:334] | NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.0 |
I0831 14:28:50.311300 1168 utils.go:334] |-------------------------------+----------------------+----------------------+
I0831 14:28:50.311497 1168 utils.go:334] | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
I0831 14:28:50.311640 1168 utils.go:334] | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
I0831 14:28:50.311784 1168 utils.go:334] | | | MIG M. |
I0831 14:28:50.311949 1168 utils.go:334] |===============================+======================+======================|
I0831 14:28:50.322257 1168 utils.go:334] | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
I0831 14:28:50.322566 1168 utils.go:334] | N/A 76C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
I0831 14:28:50.322708 1168 utils.go:334] | | | N/A |
I0831 14:28:50.322878 1168 utils.go:334] +-------------------------------+----------------------+----------------------+
I0831 14:28:50.323119 1168 utils.go:334]
I0831 14:28:50.323293 1168 utils.go:334] +-----------------------------------------------------------------------------+
I0831 14:28:50.323431 1168 utils.go:334] | Processes: |
I0831 14:28:50.323597 1168 utils.go:334] | GPU GI CI PID Type Process name GPU Memory |
I0831 14:28:50.323715 1168 utils.go:334] | ID ID Usage |
I0831 14:28:50.323863 1168 utils.go:334] |=============================================================================|
I0831 14:28:50.324222 1168 utils.go:334] | No running processes found |
I0831 14:28:50.324439 1168 utils.go:334] +-----------------------------------------------------------------------------+
I0831 14:28:50.465730 1168 modules.go:48] Updating host's ld cache
I0831 14:28:52.305122 1168 install.go:167] Finished installing the drivers.