How to configure HorizontalPodAutoscaler?

I currently have my HPA configured like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-deployment-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-deployment
  minReplicas: 2
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 133  # Scale up when CPU usage exceeds 133% of resources.requests. Must be less than resources.limits.
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 133
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 10
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120  # Scale down by 1 pod per 2 minutes

I'm new to HPAs. From what I've seen on the internet, most people seem to averageUtilization to like 50-80%. I'm trying to understand why.

As I understand it, it's a % of the deployment's resources.requests and requests is "the minimum amount of compute resources required"

Why would I want to scale up when my utilization exceeds 50% of the minimum required?

I've configured resources.requests a tiny bit ~~below~~ above my normal usage. It's got no need to scale up if it's operating under normal conditions. Normal utilization should be like 90%/R, am I wrong?

So I've set averageUtilization to 133%, that way if utilization is high, then it should start adding more pods.

Solution

I'll start by saying: if your combination of resource requests, resource limits, node sizes, and HPA settings has the effect you want, then it's correct. Are you scaling up and down when you want to, and not getting evicted more than you can tolerate? There's nothing "wrong" per se with the configuration you're showing here.

I think the one big thing I'd change in your wording is to say that resource requests are the minimum the cluster is guaranteed to allocate for you. This is not necessarily the same as the minimum the application needs.

Say the application needs 512 MiB of memory to start up; but usually 1 GiB in the steady state; but it can burst up to 2 GiB under load. I'd probably set the memory requests to 1 GiB (the steady-state value) and the limits to 2 GiB (the peak value). The risk here is, the further apart these two values are, the more likely it is that your node will run out of memory under maximum load, and there's an argument to set the two values to equal, to guarantee that you'll have the memory available and will never get evicted.

If you've set the resource requests to the steady-state expected utilization, then you'll probably want to set the HPA to target about 100% of the requests; if actual usage is significantly higher (or lower) then it's time to scale up (or down). If you've set resource requests to guaranteed allocation (requests == limits) then that's probably where you're seeing that 50-80% target.

There are also challenges with both of the built-in resource autoscaling, depending on your language runtime.

For memory autoscaling, are you using a garbage-collected language (pretty much everything other than C/C++/Rust), and if so, does it ever give memory back to the OS? You might find yourself never being able to scale down.

For CPU autoscaling, how much of your application time is going to a database or something else, and you're mostly in I/O wait? Sometimes you might need to scale up anyways, especially if you have thread-pool constraints.

I've had the best results attaching the HPA to other resources – thread-pool utilization, queue length – but these require some substantial administrator setup to make them accessible.