Search code examples
amazon-web-servicesaws-cloudformationamazon-cloudwatchamazon-cloudwatchlogs

AWS Cloudwatch Metric Alarm not triggering after first time


I have a alarm looking for error message in logs and it does trigger alarm state. But it doesn't get reset and remains in In Alarm state. I have the alarm action as SNS topic which in turn triggers email. So basically after first error I don't see any subsequent email. What's going wrong with the following template config?

"AppErrorMetric": {
  "Type": "AWS::Logs::MetricFilter",
  "Properties": {
    "LogGroupName": {
      "Ref": "AppServerLG"
    },
    "FilterPattern": "[error]",
    "MetricTransformations": [
      {
        "MetricValue": "1",
        "MetricNamespace": {
          "Fn::Join": [
            "",
            [
              {
                "Ref": "ApplicationEndpoint"
              },
              "/metrics/AppError"
            ]
          ]
        },
        "MetricName": "AppError"
      }
    ]
  }
},
"AppErrorAlarm": {
        "Type": "AWS::CloudWatch::Alarm",
        "Properties": {
    "ActionsEnabled": "true",
            "AlarmName": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "AppId"
                        },
                        ",",
                        {
                            "Ref": "AppServerAG"
                        },
                        ":",
                        "AppError",
                        ",",
                        "MINOR"
                    ]
                ]
            },
            "AlarmDescription": {
                "Fn::Join": [
                    "",
                    [
                        "service is throwing error. Please check logs.",
                        {
                            "Ref": "AppServerAG"
                        },
                        "-",
                        {
                            "Ref": "AppId"
                        }
                    ]
                ]
            },
            "MetricName": "AppError",
            "Namespace": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "ApplicationEndpoint"
                        },
                        "metrics/AppError"
                    ]
                ]
            },
            "Statistic": "Sum",
            "Period": "300",
            "EvaluationPeriods": "1",
            "Threshold": "1",
            "AlarmActions": [{
              "Fn::GetAtt": [
                "VPCInfo",
                "SNSTopic"
              ]
            }],
            "ComparisonOperator": "GreaterThanOrEqualToThreshold"
        }
}

Solution

  • Your problem is a combination of two factors:

    1. Your metric is only emitted when an error is found, it's a sparse metric so a 1 will be present on error but no 0 will be emitted if no error is present.
    2. By default CloudWatch Alarms are configured with TreatMissingData as missing.

    CloudWatch documentation about missing data says:

    For each alarm, you can specify CloudWatch to treat missing data points as any of the following:

    • notBreaching – Missing data points are treated as "good" and within the threshold,
    • breaching – Missing data points are treated as "bad" and breaching the threshold
    • ignore – The current alarm state is maintained
    • missing – The alarm doesn't consider missing data points when evaluating whether to change state

    Adding "TreatMissing": "notBreaching" parameter to your alarm configuration will cause that CloudWatch considers missing datapoints as non breaching and transitions the alarm to OK:

    "AppErrorAlarm": {
            "Type": "AWS::CloudWatch::Alarm",
            "Properties": {
                "ActionsEnabled": "true",
                "AlarmName": {
                    "Fn::Join": [
                        "",
                        [
                            {
                                "Ref": "AppId"
                            },
                            ",",
                            {
                                "Ref": "AppServerAG"
                            },
                            ":",
                            "AppError",
                            ",",
                            "MINOR"
                        ]
                    ]
                },
                "AlarmDescription": {
                    "Fn::Join": [
                        "",
                        [
                            "service is throwing error. Please check logs.",
                            {
                                "Ref": "AppServerAG"
                            },
                            "-",
                            {
                                "Ref": "AppId"
                            }
                        ]
                    ]
                },
                "MetricName": "AppError",
                "Namespace": {
                    "Fn::Join": [
                        "",
                        [
                            {
                                "Ref": "ApplicationEndpoint"
                            },
                            "metrics/AppError"
                        ]
                    ]
                },
                "Statistic": "Sum",
                "Period": "300",
                "EvaluationPeriods": "1",
                "Threshold": "1",
                "TreatMissingData": "notBreaching",
                "AlarmActions": [{
                  "Fn::GetAtt": [
                    "VPCInfo",
                    "SNSTopic"
                  ]
                }],
                "ComparisonOperator": "GreaterThanOrEqualToThreshold"
            }
    }