Search code examples
kubernetesgpunvidia

GPU-enabled Kubernetes container is not scheduled


I have installed Nvidia's GPU operator and have my GPU-enabled node automatically labelled (what I treat as important, long list of other labels is there as well):

nvidia.com/gpu.count=1

Node is seemingly schedule able

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 10 Sep 2024 15:05:17 +0000   Tue, 10 Sep 2024 15:05:17 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletReady                 kubelet is posting ready status

Node also reports as ready in "kubectl get nodes". However when I'm looking at demo workload, I see

`Warning  FailedScheduling  11s (x17 over 79m)  default-scheduler  0/6 nodes are available: 3 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.`

I have even tried to manually add label node with nvidia.com/gpu=1 no luck so far. I have followed guide from Nvidia https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html. Deviations from automatic deployment was that I have installed driver (v550) manually as Nvdia hasn't generated images for Ubuntu 24. I see an output in nvidia-smi, which essentially should be true as node being labelled by operator. Kubernetes v1.31.0 Anything else I am missing?

Tried manually label node and re-create pod. Expectations are to see pod scheduled


Solution

  • Well, it's embarrassing that I for some reason overlooked failing nvidia-operator-validator pod. Would anybody believe "I bet it was running"? Anyway looking at pod logs or description does not give any information. But going onto worker node where container is scheduled (one with GPU) and running sudo crictl ps -a shows container driver-validation with increasing fail counter. Those logs are actually useful and besides of (in my case) successfully executing nvidia-smi gives an answer:

    ` time="2024-09-12T15:33:34Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices" time="2024-09-12T15:33:34Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 1; output=modprobe: ERROR: ../libkmod/libkmod-module.c:968 kmod_module_insert_module() could not find module by name='nvidia_current_updates'\nmodprobe: ERROR: could not insert 'nvidia_current_updates': Unknown symbol in module, or unknown parameter (see dmesg)\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\"" `
    

    I wasn't savvy enough to understand where would I put that ClusterPolicy, but reinstalling gpu-operator with helm install --wait gpu-operator-1 -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false --set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION --set-string validator.driver.env[0].value=true saved the day.

    Update 1. I was right that all was working until GPU-enabled worker was rebooted. After reboot host was not seeing Nvidia drivers, but even after reinstalling them, feature discovery, container toolkit and device plugin pods were in fail backoff state. Quick fix was to reinstall GPU operator, but definitely it's not a fix.