I am creating a Keda ScaledObject
in a cloud GPU provider which exposes various metrics via a Prometheus instance, e.g.:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: [name]
namespace: [namespace]
spec:
cooldownPeriod: 30
fallback:
failureThreshold: 20
replicas: 0
maxReplicaCount: 4
minReplicaCount: 1
pollingInterval: 15
scaleTargetRef:
name: [deployment]
triggers:
- metadata:
metricName: gpu-util
metricType: Value
query: |-
avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
serverAddress: [address]:9090
threshold: '80'
type: prometheus
DCGM_FI_DEV_GPU_UTIL
is an NVIDIA metric for GPU utilization. It creates a ScaledObject
that appears to be functioning normally:
$ kubectl describe scaledobject [name] -n [namespace]
Name: [name]
Namespace: [namespace]
Labels: scaledobject.keda.sh/name=[name]
Annotations: <none>
API Version: keda.sh/v1alpha1
Kind: ScaledObject
Metadata:
Creation Timestamp: 2023-04-28T01:27:50Z
Finalizers:
finalizer.keda.sh
Generation: 1
Resource Version: 36215438066
UID: [uid]
Spec:
Cooldown Period: 30
Fallback:
Failure Threshold: 20
Replicas: 0
Max Replica Count: 4
Min Replica Count: 1
Polling Interval: 15
Scale Target Ref:
Name: hashtop-1
Triggers:
Metadata:
Metric Name: gpu-util
Namespace: [namespace]
Query: avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
Server Address: [url]:9090
Threshold: 80
Type: prometheus
Status:
Conditions:
Message: ScaledObject is defined correctly and is ready for scaling
Reason: ScaledObjectReady
Status: True
Type: Ready
Message: Scaling is not performed because triggers are not active
Reason: ScalerNotActive
Status: False
Type: Active
Message: No fallbacks are active on this scaled object
Reason: NoFallbackFound
Status: False
Type: Fallback
External Metric Names:
s0-prometheus-gpu-util
Health:
s0-prometheus-gpu-util:
Number Of Failures: 0
Status: Happy
Original Replica Count: 1
Scale Target GVKR:
Group: apps
Kind: Deployment
Resource: deployments
Version: v1
Scale Target Kind: apps/v1.Deployment
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal KEDAScalersStarted 78s keda-operator Started scalers watch
Normal ScaledObjectReady 63s (x2 over 78s) keda-operator ScaledObject is ready for scaling
When I run this query directly in Prometheus, I receive the results I expect:
# heavy utilization
$ curl '[url]/api/v1/query?query=avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL\[1m\]))' | jq '.'
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {},
"value": [
1682728336,
"99"
]
}
]
}
}
Here I can see the result is "99"
% utilization. When the GPUs are idle it quickly goes to "0"
.
However, the HorizontalPodAutoscaler
that Keda creates does not seem to match this data. For example, when I have one GPU idle and Prometheus returns "0"
in the above query, the HPA looks like this:
# one GPU idle
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
[name] Deployment/[deployment] 19/80 (avg) 1 4 1 28m
The value hovers around 18-20
but never leaves that range. With four pegged GPUs the HPA reports very high numbers:
# Four GPUs 99% utilization
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
[name] Deployment/[deployment] 36117500m/80 (avg) 1 4 4 2m27s
As such, my desired autoscaling behavior cannot be achieved. Since this is in a cloud provider, I do not have direct access to the Keda Operator itself.
What can I change in the ScaledObject
definition to create an HPA that scales based on this GPU utilization metric from Prometheus?
This may be obvious to someone more experience with Prometheus, but through trial and error I solved the issue by limiting the query to a pod regex:
query: |-
sum(avg_over_time(DCGM_FI_DEV_GPU_UTIL{pod=~"gpu-.*"}[1m]))
I can only guess that the data returned from the Prometheus API was scoped differently than the data the Keda ScaledObject
was receiving, causing the mismatch.