Search code examples
kubernetesgpugoogle-kubernetes-enginenvidiadaemonset

GPU's not showing up on GKE Node even though they show up in GKE NodePool


I'm trying to setup a Google Kubernetes Engine cluster with GPU's in the nodes loosely following these instructions, because I'm programmatically deploying using the Python client.

For some reason I can create a cluster with a NodePool that contains GPU's GKE NodePool with GPUs

...But, the nodes in the NodePool don't have access to those GPUs. Node without access to GPUs

I've already installed the NVIDIA DaemonSet with this yaml file: https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

You can see that it's there in this image: enter image description here

For some reason those 2 lines always seem to be in status "ContainerCreating" and "PodInitializing". They never flip green to status = "Running". How can I get the GPU's in the NodePool to become available in the node(s)?

Update:

Based on comments I ran the following commands on the 2 NVIDIA pods; kubectl describe pod POD_NAME --namespace kube-system.

To do this I opened the UI KUBECTL command terminal on the node. Then I ran the following commands:

gcloud container clusters get-credentials CLUSTER-NAME --zone ZONE --project PROJECT-NAME

Then, I called kubectl describe pod nvidia-gpu-device-plugin-UID --namespace kube-system and got this output:

Name:                 nvidia-gpu-device-plugin-UID
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time:           Wed, 02 Mar 2022 20:19:49 +0000
Labels:               controller-revision-hash=79765599fc
                      k8s-app=nvidia-gpu-device-plugin
                      pod-template-generation=1
Annotations:          <none>
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        DaemonSet/nvidia-gpu-device-plugin
Containers:
  nvidia-gpu-device-plugin:
    Container ID:
    Image:         gcr.io/gke-release/nvidia-gpu-device-plugin@sha256:aa80c85c274a8e8f78110cae33cc92240d2f9b7efb3f53212f1cefd03de3c317
    Image ID:
    Port:          2112/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/nvidia-gpu-device-plugin
      -logtostderr
      --enable-container-gpu-metrics
      --enable-health-monitoring
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  50Mi
    Requests:
      cpu:     50m
      memory:  20Mi
    Environment:
      LD_LIBRARY_PATH:  /usr/local/nvidia/lib64
    Mounts:
      /dev from dev (rw)
      /device-plugin from device-plugin (rw)
      /etc/nvidia from nvidia-config (rw)
      /proc from proc (rw)
      /usr/local/nvidia from nvidia (rw)
      /var/lib/kubelet/pod-resources from pod-resources (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:
  nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /home/kubernetes/bin/nvidia
    HostPathType:  Directory
  pod-resources:
    Type:          HostPath (bare host directory volume) 
    Path:          /var/lib/kubelet/pod-resources
    HostPathType:
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:
  nvidia-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/nvidia
    HostPathType:
  default-token-qnxjr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qnxjr
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoExecute op=Exists
                 :NoSchedule op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
    Type     Reason       Age                   From               Message
    ----     ------       ----                  ----               -------
  Normal   Scheduled    8m55s                 default-scheduler  Successfully assigned kube-system/nvidia-gpu-device-plugin-hxdwx to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
  Warning  FailedMount  6m42s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources proc]: timed out waiting for the condition
  Warning  FailedMount  4m25s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[proc nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources]: timed out waiting for the condition
  Warning  FailedMount  2m11s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[pod-resources proc nvidia-config default-token-qnxjr device-plugin dev nvidia]: timed out waiting for the condition
  Warning  FailedMount  31s (x12 over 8m45s)  kubelet            MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory

Then, I called kubectl describe pod nvidia-driver-installer-UID --namespace kube-system and got this output:

Name:         nvidia-driver-installer-UID
Namespace:    kube-system
Priority:     0
Node:         gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time:   Wed, 02 Mar 2022 20:20:06 +0000
Labels:       controller-revision-hash=6bbfc44f6d
              k8s-app=nvidia-driver-installer
              name=nvidia-driver-installer
              pod-template-generation=1
Annotations:  <none>
Status:       Pending
IP:           10.56.0.9
IPs:
  IP:           10.56.0.9
Controlled By:  DaemonSet/nvidia-driver-installer
Init Containers:
  nvidia-driver-installer:
    Container ID:
    Image:          gke-nvidia-installer:fixed
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        150m
    Environment:  <none>
    Mounts:
      /boot from boot (rw)
      /dev from dev (rw)
      /root from root-mount (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Containers:
  pause:
    Container ID:
    Image:          gcr.io/google-containers/pause:2.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:
  boot:
    Type:          HostPath (bare host directory volume)
    Path:          /boot
    HostPathType:
  root-mount:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  default-token-qnxjr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qnxjr
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m20s                  default-scheduler  Successfully assigned kube-system/nvidia-driver-installer-tzw42 to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
  Normal   Pulling    2m36s (x4 over 4m19s)  kubelet            Pulling image "gke-nvidia-installer:fixed"
  Warning  Failed     2m34s (x4 over 4m10s)  kubelet            Failed to pull image "gke-nvidia-installer:fixed": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/gke-nvidia-installer:fixed": failed to resolve reference "docker.io/library/gke-nvidia-installer:fixed": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
  Warning  Failed     2m34s (x4 over 4m10s)  kubelet            Error: ErrImagePull
  Warning  Failed     2m22s (x6 over 4m9s)   kubelet            Error: ImagePullBackOff
  Normal   BackOff    2m7s (x7 over 4m9s)    kubelet            Back-off pulling image "gke-nvidia-installer:fixed"

Solution

  • According the docker image that the container is trying to pull (gke-nvidia-installer:fixed), it looks like you're trying use Ubuntu daemonset instead of cos.

    You should run kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

    This will apply the right daemonset for your cos node pool, as stated here.

    In addition, please verify your node pool has the https://www.googleapis.com/auth/devstorage.read_only scope which is needed to pull the image. You can should see it in your node pool page in GCP Console, under Security -> Access scopes (The relevant service is Storage).