kubernetes prometheus autoscaling horizontal-pod-autoscaling keda

Keda ScaledObject metric data differs from Prometheus

I am creating a Keda ScaledObject in a cloud GPU provider which exposes various metrics via a Prometheus instance, e.g.:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: [name]
  namespace: [namespace]
spec:
  cooldownPeriod: 30
  fallback:
    failureThreshold: 20
    replicas: 0
  maxReplicaCount: 4
  minReplicaCount: 1
  pollingInterval: 15
  scaleTargetRef:
    name: [deployment]
  triggers:
  - metadata:
      metricName: gpu-util
      metricType: Value
      query: |-
        avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
      serverAddress: [address]:9090
      threshold: '80'
    type: prometheus

DCGM_FI_DEV_GPU_UTIL is an NVIDIA metric for GPU utilization. It creates a ScaledObject that appears to be functioning normally:

$ kubectl describe scaledobject [name] -n [namespace]
Name:         [name]
Namespace:    [namespace]
Labels:       scaledobject.keda.sh/name=[name]
Annotations:  <none>
API Version:  keda.sh/v1alpha1
Kind:         ScaledObject
Metadata:
  Creation Timestamp:  2023-04-28T01:27:50Z
  Finalizers:
    finalizer.keda.sh
  Generation:        1
  Resource Version:  36215438066
  UID:               [uid]
Spec:
  Cooldown Period:  30
  Fallback:
    Failure Threshold:  20
    Replicas:           0
  Max Replica Count:    4
  Min Replica Count:    1
  Polling Interval:     15
  Scale Target Ref:
    Name:  hashtop-1
  Triggers:
    Metadata:
      Metric Name:     gpu-util
      Namespace:       [namespace]
      Query:           avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
      Server Address:  [url]:9090
      Threshold:       80
    Type:              prometheus
Status:
  Conditions:
    Message:  ScaledObject is defined correctly and is ready for scaling
    Reason:   ScaledObjectReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Message:  No fallbacks are active on this scaled object
    Reason:   NoFallbackFound
    Status:   False
    Type:     Fallback
  External Metric Names:
    s0-prometheus-gpu-util
  Health:
    s0-prometheus-gpu-util:
      Number Of Failures:  0
      Status:              Happy
  Original Replica Count:  1
  Scale Target GVKR:
    Group:            apps
    Kind:             Deployment
    Resource:         deployments
    Version:          v1
  Scale Target Kind:  apps/v1.Deployment
Events:
  Type    Reason              Age                From           Message
  ----    ------              ----               ----           -------
  Normal  KEDAScalersStarted  78s                keda-operator  Started scalers watch
  Normal  ScaledObjectReady   63s (x2 over 78s)  keda-operator  ScaledObject is ready for scaling

When I run this query directly in Prometheus, I receive the results I expect:

# heavy utilization
$ curl '[url]/api/v1/query?query=avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL\[1m\]))' | jq '.'
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {},
        "value": [
          1682728336,
          "99"
        ]
      }
    ]
  }
}

Here I can see the result is "99"% utilization. When the GPUs are idle it quickly goes to "0".

However, the HorizontalPodAutoscaler that Keda creates does not seem to match this data. For example, when I have one GPU idle and Prometheus returns "0" in the above query, the HPA looks like this:

# one GPU idle
$ kubectl get hpa
NAME                                 REFERENCE              TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
[name]   Deployment/[deployment]   19/80 (avg)   1         4         1          28m

The value hovers around 18-20 but never leaves that range. With four pegged GPUs the HPA reports very high numbers:

# Four GPUs 99% utilization
$ kubectl get hpa
NAME                                 REFERENCE              TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
[name]   Deployment/[deployment]   36117500m/80 (avg)   1         4         4          2m27s

As such, my desired autoscaling behavior cannot be achieved. Since this is in a cloud provider, I do not have direct access to the Keda Operator itself.

What can I change in the ScaledObject definition to create an HPA that scales based on this GPU utilization metric from Prometheus?

Solution

This may be obvious to someone more experience with Prometheus, but through trial and error I solved the issue by limiting the query to a pod regex:

query: |-
        sum(avg_over_time(DCGM_FI_DEV_GPU_UTIL{pod=~"gpu-.*"}[1m]))

I can only guess that the data returned from the Prometheus API was scoped differently than the data the Keda ScaledObject was receiving, causing the mismatch.