Search code examples
gpugoogle-kubernetes-engineprometheuskubernetes-podnvidia-docker

On GKE, dcgm-exporter pod fails to run if the nvidia.com/gpu resource is not allocated


I am trying to query GPU usage metrics of GKE pods.

Here is what I've done for test:

  1. Created GKE cluster with two node pools, one of them has two cpu-only nodes and the other has one node with NVIDIA Tesla T4 GPU. All nodes are running Container-Optimized OS.
  2. As written in https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers, I ran kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml.
  3. kubectl create -f dcgm-exporter.yaml
# dcgm-exporter.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.1.1"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.1.1"
      name: "dcgm-exporter"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      containers:
      - image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
        # resources:
        #   limits:
        #     nvidia.com/gpu: "1"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
          capabilities:
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
      tolerations:
        - effect: "NoExecute"
          operator: "Exists"
        - effect: "NoSchedule"
          operator: "Exists"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
---

kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port: '9400'
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
  ports:
  - name: "metrics"
    port: 9400
  1. The pod runs only on the gpu node but crashes with the following error:
time="2020-11-21T04:27:21Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-11-21T04:27:21Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

With uncommenting the resources: limits: nvidia.com/gpu: "1", it successfully runs. However, I don't want this pod to occupy any GPU but just watch them.

How can I run the dcgm-exporter without allocating GPU to it? I tried with Ubuntu nodes but failed, too.


Solution

  • It worked with these:

    1. Set privileged: true to securityContext.
    2. Add volume mount "nvidia-install-dir-host".
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: "dcgm-exporter"
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.1.1"
    spec:
      updateStrategy:
        type: RollingUpdate
      selector:
        matchLabels:
          app.kubernetes.io/name: "dcgm-exporter"
          app.kubernetes.io/version: "2.1.1"
      template:
        metadata:
          labels:
            app.kubernetes.io/name: "dcgm-exporter"
            app.kubernetes.io/version: "2.1.1"
          name: "dcgm-exporter"
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: cloud.google.com/gke-accelerator
                    operator: Exists
          containers:
          - image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
            env:
            - name: "DCGM_EXPORTER_LISTEN"
              value: ":9400"
            - name: "DCGM_EXPORTER_KUBERNETES"
              value: "true"
            name: "dcgm-exporter"
            ports:
            - name: "metrics"
              containerPort: 9400
            securityContext:
              privileged: true
            volumeMounts:
            - name: "pod-gpu-resources"
              readOnly: true
              mountPath: "/var/lib/kubelet/pod-resources"
            - name: "nvidia-install-dir-host"
              mountPath: "/usr/local/nvidia"
          tolerations:
            - effect: "NoExecute"
              operator: "Exists"
            - effect: "NoSchedule"
              operator: "Exists"
          volumes:
          - name: "pod-gpu-resources"
            hostPath:
              path: "/var/lib/kubelet/pod-resources"
          - name: "nvidia-install-dir-host"
            hostPath:
              path: "/home/kubernetes/bin/nvidia"
    ---
    
    kind: Service
    apiVersion: v1
    metadata:
      name: "dcgm-exporter"
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.1.1"
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9400'
    spec:
      selector:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.1.1"
      ports:
      - name: "metrics"
        port: 9400