I'm relatively new to AWS, and am trying to understand how (if its possible) to configure a model using an async endpoint
to scale up from 0 instances when there are no requests to 1 or more instances when a request arrives, then shutdown. Also ideally if a second request arrives then scale up another instance up to n.
I'm trying to compare async inference
with batch inference
, I have large jobs that need processing periodically on demand. I've not yet decided which to go with - I'm fairly confident I have a good understanding of batch inference
and I have async inference
working but am struggling with auto-scaling.
I think I have the process of using the API to register-scalable-target
and 1put-scaling-policyand when I use
application-autoscaling describe-scaling-activities` I can sometimes see it scaling in and out but I cannot get it to do what I want.
I've played with a few options, but currently I have:
TargetTrackingScalingPolicyConfiguration={
"TargetValue":1.0,
"CustomizedMetricSpecification": {
"MetricName": "HasBacklogWithoutCapacity",
"Namespace": "AWS/SageMaker",
"Dimensions": [{"Name": "EndpointName", "Value": endpoint_name}],
"Statistic": "Maximum",
},
"ScaleInCooldown": 0,
"ScaleOutCooldown": 30
}
Edit: I've learnt now that you can monitor using cloudwatch under alarms and have changed the question. Now that I've found that I have a better understanding - but the time it takes for the alarms to trigger is excessively long for what I'm trying to achieve.
There are two alarms for scale in and scale out:
HasBacklogWithoutCapacity
HasBacklogWithoutCapacity < 0.45 for 15 datapoints within 15 minutes
HasBacklogWithoutCapacity
HasBacklogWithoutCapacity > 0.5 for 3 datapoints within 3 minutes
So how does it determine 15 minutes for the first and 3 minutes for the second alarm (and more importantly how do I reduce this)? I don't really mind it taking a while to scale from 1 to 0, but I'd like it to scale from 0 to 1 as quickly as possible.
From the docs, "A target tracking scaling policy is more aggressive in adding capacity when utilization increases than it is in removing capacity when utilization decreases.", which would explain the behavior of how fast it scales up vs down.
Since you wanted to scale instances proportionally with number of requests, you should look into step scaling policies, which will allow you to "choose scaling metrics and threshold values for the CloudWatch alarms that trigger the scaling process as well as define how your scalable target should be scaled when a threshold is in breach for a specified number of evaluation periods".
In this case, you wanted to scale from 0 to 1 as quickly as possible, so you would just minimize the number of evaluation periods. (consider if outliers triggering scaling is a concern based on resources at hand, likelihood/frequency of outliers, etc.)