Search code examples
amazon-web-servicesmonitoringamazon-cloudwatch

Delay in AWS Cloudwatch Alarm state change


I have an alarm tracking the metric for LoadBalancer 5xx errors n a single ALB. This should be in an "In alarm" state if 1 datapoint in the past 1 is above the threshold of 2. The period is set to 1 minute. See the alarm details:

enter image description here

On 2020-09-23 at 17:18 UTC the Load Balancer started to return 502 errors. This is shown in the Cloudwatch metric chart below, and I've confirmed the times are correct (this was a forced 502 response so I know when I triggered it and I can see the 17:18 timestamp in the ALB logs)

enter image description here

But in the alarm log, the "In Alarm" state was only triggered at 17:22 UTC - 4 minutes after the 17:18 period had more than 2 errors. This isn't a delay in receiving a notification - it's about a delay in the state change compared to my expectation. Notifications were correctly received within seconds of the state change.

Here is the Alarm log with the state change timestamps: enter image description here

We consider missing data as GOOD, so based on the metric graph, I assume it should have recovered to OK at 17:22 (after the 17:21 period with 0 errors) but only returned to OK at 17:27 - 5minutes delay.

I then expected it to return to "In alarm" at 17:24, but this didn't return until 17:28.

Finally, I expect it to have returned to OK at 17:31 but it took until 17:40 - a full 9 minutes afterwards.

Why is there a 4-9 minute delay between when I expect a state transition and it actually happening?


Solution

  • I think the explanation is given in the following AWS forum:

    . Unexplainable delay between Alarm data breach and Alarm state change

    Basicially alarms are evaluated on longer period then what you set, not only 1 minute. The period is evaluation range, and you as a user, don't have direct control on it.

    From the forum:

    The reporting criteria for the HTTPCode_Target_4XX_Count metric is if there is a non-zero value. That means data point will only be reported if a non-zero value is generated, otherwise nothing will be pushed to the metric.

    CloudWatch standard alarm evaluates its state every minute and no matter what value you set for how to treat missing data, when an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than specified by Evaluation Periods (1 in this case). The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The time frame of the data points that it attempts to retrieve is the evaluation range. Treat missing data as setting is applied if all the data in the evaluation range is missing, and not just if the data in evaluation period is missing.

    Hence, CloudWatch alarms will look at some previous data points to evaluate its state, and will use the treat missing data as setting if all the data in evaluation range is missing. In this case, for the time when alarm did not transition to OK state, it was using the previous data points in the evaluation range to evaluate its state, as expected.

    The alarm evaluation in case of missing data is explained in detail here, that will help in understanding this further: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-evaluating-missing-data