I have a alarm looking for error
message in logs and it does trigger alarm state. But it doesn't get reset and remains in In Alarm
state. I have the alarm action as SNS topic which in turn triggers email. So basically after first error I don't see any subsequent email. What's going wrong with the following template config?
"AppErrorMetric": {
"Type": "AWS::Logs::MetricFilter",
"Properties": {
"LogGroupName": {
"Ref": "AppServerLG"
},
"FilterPattern": "[error]",
"MetricTransformations": [
{
"MetricValue": "1",
"MetricNamespace": {
"Fn::Join": [
"",
[
{
"Ref": "ApplicationEndpoint"
},
"/metrics/AppError"
]
]
},
"MetricName": "AppError"
}
]
}
},
"AppErrorAlarm": {
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"ActionsEnabled": "true",
"AlarmName": {
"Fn::Join": [
"",
[
{
"Ref": "AppId"
},
",",
{
"Ref": "AppServerAG"
},
":",
"AppError",
",",
"MINOR"
]
]
},
"AlarmDescription": {
"Fn::Join": [
"",
[
"service is throwing error. Please check logs.",
{
"Ref": "AppServerAG"
},
"-",
{
"Ref": "AppId"
}
]
]
},
"MetricName": "AppError",
"Namespace": {
"Fn::Join": [
"",
[
{
"Ref": "ApplicationEndpoint"
},
"metrics/AppError"
]
]
},
"Statistic": "Sum",
"Period": "300",
"EvaluationPeriods": "1",
"Threshold": "1",
"AlarmActions": [{
"Fn::GetAtt": [
"VPCInfo",
"SNSTopic"
]
}],
"ComparisonOperator": "GreaterThanOrEqualToThreshold"
}
}
Your problem is a combination of two factors:
TreatMissingData
as missing
.CloudWatch documentation about missing data says:
For each alarm, you can specify CloudWatch to treat missing data points as any of the following:
- notBreaching – Missing data points are treated as "good" and within the threshold,
- breaching – Missing data points are treated as "bad" and breaching the threshold
- ignore – The current alarm state is maintained
- missing – The alarm doesn't consider missing data points when evaluating whether to change state
Adding "TreatMissing": "notBreaching"
parameter to your alarm configuration will cause that CloudWatch considers missing datapoints as non breaching and transitions the alarm to OK:
"AppErrorAlarm": {
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"ActionsEnabled": "true",
"AlarmName": {
"Fn::Join": [
"",
[
{
"Ref": "AppId"
},
",",
{
"Ref": "AppServerAG"
},
":",
"AppError",
",",
"MINOR"
]
]
},
"AlarmDescription": {
"Fn::Join": [
"",
[
"service is throwing error. Please check logs.",
{
"Ref": "AppServerAG"
},
"-",
{
"Ref": "AppId"
}
]
]
},
"MetricName": "AppError",
"Namespace": {
"Fn::Join": [
"",
[
{
"Ref": "ApplicationEndpoint"
},
"metrics/AppError"
]
]
},
"Statistic": "Sum",
"Period": "300",
"EvaluationPeriods": "1",
"Threshold": "1",
"TreatMissingData": "notBreaching",
"AlarmActions": [{
"Fn::GetAtt": [
"VPCInfo",
"SNSTopic"
]
}],
"ComparisonOperator": "GreaterThanOrEqualToThreshold"
}
}