Search code examples
dockergoogle-compute-enginenvidiagoogle-container-optimized-os

How can I get `cos-extensions install gpu` to work on a Google Cloud VM?


I'm trying to set up a container-optimized OS (COS) on GCE with a GPU, following the instructions at https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus. After creating the VM, it says to ssh in and run cos-extensions install gpu. That works; you can see during the install it runs nvidia-smi which prints out the driver version (440.33.01) and connects to the card.

But it installs the nvidia bins and libs in /var/lib/nvidia, which is mounted as noexec in this OS (it's very locked down). That means none of the libs or utilities work. And when you mount them to a docker container, they don't work there either; they're still noexec.

The only workaround I've found is to copy the whole /var/lib/nvidia dir to a tmpfs scratch disk and use it from there. Am I using it wrong, or is it just broken?


Solution

  • This doesn't look to be a containerd issue but rather a Container-Optimized OS expected behaviour due to COS provides another level of hardening by providing security-minded default values for several features.

    If you look at the documentation, for Container-Optimized OS filesystem, everything under /var is mounted as no-exec except for

    • /var/lib/google
    • /var/lib/docker
    • /var/lib/toolbox

    Those are mounted with writable, executable and stateful properties.

    On the other hand, Ubuntu containerd does not have the same strict exec/noexec depending on the mount like with COS, so, it could be a good idea to use Ubuntu based images instead of COS as a workaround.

    Another option is to copy the contents of the /var/lib/nvidiaunder another mount point that was not mounted using the noexec option, as you already did.