Search code examples
kubernetesprometheusautoscalinghorizontal-pod-autoscalingkeda

Keda ScaledObject metric data differs from Prometheus


I am creating a Keda ScaledObject in a cloud GPU provider which exposes various metrics via a Prometheus instance, e.g.:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: [name]
  namespace: [namespace]
spec:
  cooldownPeriod: 30
  fallback:
    failureThreshold: 20
    replicas: 0
  maxReplicaCount: 4
  minReplicaCount: 1
  pollingInterval: 15
  scaleTargetRef:
    name: [deployment]
  triggers:
  - metadata:
      metricName: gpu-util
      metricType: Value
      query: |-
        avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
      serverAddress: [address]:9090
      threshold: '80'
    type: prometheus

DCGM_FI_DEV_GPU_UTIL is an NVIDIA metric for GPU utilization. It creates a ScaledObject that appears to be functioning normally:

$ kubectl describe scaledobject [name] -n [namespace]
Name:         [name]
Namespace:    [namespace]
Labels:       scaledobject.keda.sh/name=[name]
Annotations:  <none>
API Version:  keda.sh/v1alpha1
Kind:         ScaledObject
Metadata:
  Creation Timestamp:  2023-04-28T01:27:50Z
  Finalizers:
    finalizer.keda.sh
  Generation:        1
  Resource Version:  36215438066
  UID:               [uid]
Spec:
  Cooldown Period:  30
  Fallback:
    Failure Threshold:  20
    Replicas:           0
  Max Replica Count:    4
  Min Replica Count:    1
  Polling Interval:     15
  Scale Target Ref:
    Name:  hashtop-1
  Triggers:
    Metadata:
      Metric Name:     gpu-util
      Namespace:       [namespace]
      Query:           avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
      Server Address:  [url]:9090
      Threshold:       80
    Type:              prometheus
Status:
  Conditions:
    Message:  ScaledObject is defined correctly and is ready for scaling
    Reason:   ScaledObjectReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Message:  No fallbacks are active on this scaled object
    Reason:   NoFallbackFound
    Status:   False
    Type:     Fallback
  External Metric Names:
    s0-prometheus-gpu-util
  Health:
    s0-prometheus-gpu-util:
      Number Of Failures:  0
      Status:              Happy
  Original Replica Count:  1
  Scale Target GVKR:
    Group:            apps
    Kind:             Deployment
    Resource:         deployments
    Version:          v1
  Scale Target Kind:  apps/v1.Deployment
Events:
  Type    Reason              Age                From           Message
  ----    ------              ----               ----           -------
  Normal  KEDAScalersStarted  78s                keda-operator  Started scalers watch
  Normal  ScaledObjectReady   63s (x2 over 78s)  keda-operator  ScaledObject is ready for scaling

When I run this query directly in Prometheus, I receive the results I expect:

# heavy utilization
$ curl '[url]/api/v1/query?query=avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL\[1m\]))' | jq '.'
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {},
        "value": [
          1682728336,
          "99"
        ]
      }
    ]
  }
}

Here I can see the result is "99"% utilization. When the GPUs are idle it quickly goes to "0".

However, the HorizontalPodAutoscaler that Keda creates does not seem to match this data. For example, when I have one GPU idle and Prometheus returns "0" in the above query, the HPA looks like this:

# one GPU idle
$ kubectl get hpa
NAME                                 REFERENCE              TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
[name]   Deployment/[deployment]   19/80 (avg)   1         4         1          28m

The value hovers around 18-20 but never leaves that range. With four pegged GPUs the HPA reports very high numbers:

# Four GPUs 99% utilization
$ kubectl get hpa
NAME                                 REFERENCE              TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
[name]   Deployment/[deployment]   36117500m/80 (avg)   1         4         4          2m27s

As such, my desired autoscaling behavior cannot be achieved. Since this is in a cloud provider, I do not have direct access to the Keda Operator itself.

What can I change in the ScaledObject definition to create an HPA that scales based on this GPU utilization metric from Prometheus?


Solution

  • This may be obvious to someone more experience with Prometheus, but through trial and error I solved the issue by limiting the query to a pod regex:

    query: |-
            sum(avg_over_time(DCGM_FI_DEV_GPU_UTIL{pod=~"gpu-.*"}[1m]))
    

    I can only guess that the data returned from the Prometheus API was scoped differently than the data the Keda ScaledObject was receiving, causing the mismatch.