kubernetes gpu google-kubernetes-engine nvidia daemonset

GPU's not showing up on GKE Node even though they show up in GKE NodePool

I'm trying to setup a Google Kubernetes Engine cluster with GPU's in the nodes loosely following these instructions, because I'm programmatically deploying using the Python client.

For some reason I can create a cluster with a NodePool that contains GPU's

...But, the nodes in the NodePool don't have access to those GPUs.

I've already installed the NVIDIA DaemonSet with this yaml file: https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

You can see that it's there in this image:

For some reason those 2 lines always seem to be in status "ContainerCreating" and "PodInitializing". They never flip green to status = "Running". How can I get the GPU's in the NodePool to become available in the node(s)?

Update:

Based on comments I ran the following commands on the 2 NVIDIA pods; kubectl describe pod POD_NAME --namespace kube-system.

To do this I opened the UI KUBECTL command terminal on the node. Then I ran the following commands:

gcloud container clusters get-credentials CLUSTER-NAME --zone ZONE --project PROJECT-NAME

Then, I called kubectl describe pod nvidia-gpu-device-plugin-UID --namespace kube-system and got this output:

Name:                 nvidia-gpu-device-plugin-UID
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time:           Wed, 02 Mar 2022 20:19:49 +0000
Labels:               controller-revision-hash=79765599fc
                      k8s-app=nvidia-gpu-device-plugin
                      pod-template-generation=1
Annotations:          <none>
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        DaemonSet/nvidia-gpu-device-plugin
Containers:
  nvidia-gpu-device-plugin:
    Container ID:
    Image:         gcr.io/gke-release/nvidia-gpu-device-plugin@sha256:aa80c85c274a8e8f78110cae33cc92240d2f9b7efb3f53212f1cefd03de3c317
    Image ID:
    Port:          2112/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/nvidia-gpu-device-plugin
      -logtostderr
      --enable-container-gpu-metrics
      --enable-health-monitoring
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  50Mi
    Requests:
      cpu:     50m
      memory:  20Mi
    Environment:
      LD_LIBRARY_PATH:  /usr/local/nvidia/lib64
    Mounts:
      /dev from dev (rw)
      /device-plugin from device-plugin (rw)
      /etc/nvidia from nvidia-config (rw)
      /proc from proc (rw)
      /usr/local/nvidia from nvidia (rw)
      /var/lib/kubelet/pod-resources from pod-resources (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:
  nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /home/kubernetes/bin/nvidia
    HostPathType:  Directory
  pod-resources:
    Type:          HostPath (bare host directory volume) 
    Path:          /var/lib/kubelet/pod-resources
    HostPathType:
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:
  nvidia-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/nvidia
    HostPathType:
  default-token-qnxjr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qnxjr
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoExecute op=Exists
                 :NoSchedule op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
    Type     Reason       Age                   From               Message
    ----     ------       ----                  ----               -------
  Normal   Scheduled    8m55s                 default-scheduler  Successfully assigned kube-system/nvidia-gpu-device-plugin-hxdwx to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
  Warning  FailedMount  6m42s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources proc]: timed out waiting for the condition
  Warning  FailedMount  4m25s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[proc nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources]: timed out waiting for the condition
  Warning  FailedMount  2m11s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[pod-resources proc nvidia-config default-token-qnxjr device-plugin dev nvidia]: timed out waiting for the condition
  Warning  FailedMount  31s (x12 over 8m45s)  kubelet            MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory

Then, I called kubectl describe pod nvidia-driver-installer-UID --namespace kube-system and got this output:

Name:         nvidia-driver-installer-UID
Namespace:    kube-system
Priority:     0
Node:         gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time:   Wed, 02 Mar 2022 20:20:06 +0000
Labels:       controller-revision-hash=6bbfc44f6d
              k8s-app=nvidia-driver-installer
              name=nvidia-driver-installer
              pod-template-generation=1
Annotations:  <none>
Status:       Pending
IP:           10.56.0.9
IPs:
  IP:           10.56.0.9
Controlled By:  DaemonSet/nvidia-driver-installer
Init Containers:
  nvidia-driver-installer:
    Container ID:
    Image:          gke-nvidia-installer:fixed
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        150m
    Environment:  <none>
    Mounts:
      /boot from boot (rw)
      /dev from dev (rw)
      /root from root-mount (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Containers:
  pause:
    Container ID:
    Image:          gcr.io/google-containers/pause:2.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:
  boot:
    Type:          HostPath (bare host directory volume)
    Path:          /boot
    HostPathType:
  root-mount:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  default-token-qnxjr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qnxjr
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m20s                  default-scheduler  Successfully assigned kube-system/nvidia-driver-installer-tzw42 to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
  Normal   Pulling    2m36s (x4 over 4m19s)  kubelet            Pulling image "gke-nvidia-installer:fixed"
  Warning  Failed     2m34s (x4 over 4m10s)  kubelet            Failed to pull image "gke-nvidia-installer:fixed": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/gke-nvidia-installer:fixed": failed to resolve reference "docker.io/library/gke-nvidia-installer:fixed": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
  Warning  Failed     2m34s (x4 over 4m10s)  kubelet            Error: ErrImagePull
  Warning  Failed     2m22s (x6 over 4m9s)   kubelet            Error: ImagePullBackOff
  Normal   BackOff    2m7s (x7 over 4m9s)    kubelet            Back-off pulling image "gke-nvidia-installer:fixed"

Solution

According the docker image that the container is trying to pull (gke-nvidia-installer:fixed), it looks like you're trying use Ubuntu daemonset instead of cos.

You should run kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

This will apply the right daemonset for your cos node pool, as stated here.

In addition, please verify your node pool has the https://www.googleapis.com/auth/devstorage.read_only scope which is needed to pull the image. You can should see it in your node pool page in GCP Console, under Security -> Access scopes (The relevant service is Storage).