amazon-web-services amazon-cloudwatchlogs amazon-cloudwatch-metrics

AWS CloudWatch Logs metrics for handled and unhandled exceptions

I have an interesting scenario with AWS CloudWatch Logs. I currently use log4net and pump all of the logs into CloudWatch Logs using CloudWatch Logs agent. I have a metric in CloudWatch which basically scans for [ERROR] entries and an Alarm passes them on to another service for dev notifications as they occur (Threshold >= 1, period - 1 min). All of this is working great.

Now I want to handle certain errors differently. For instance, based on the exception type i want to only trigger Alarm when X number of occurrences happened during N minute period. So in this case I'd create a metric for this condition and then assign it the Alarm. The problem is the general error metric, explained in the first part of this question, is still tracking each individual error occurrence. So now i'm getting multiple notifications. One for each error and one after X number of occurrences.

I can disable general error metric, but than I lose the ability to track unhandled exceptions. I'd have to have a metric for each and every possible exception. Am i missing something? What's the best way to handle this?

Solution

You can generally handle this by creating a function to do some additional processing before you are notified. The easiest way to do this would be to subscribe an AWS Lambda function to your unhandled error alarm's SNS topic. Unsubscribe yourself from the topic, and have the lambda function notify you instead of SNS only after any conditions you define have been passed.

For this situation, it sounds like you would want to suppress notifications from your individual metric for unhandled errors matching your aggregate metric while your aggregate metric is in the alarm state.

Pseudocode:

Use DescribeAlarms API to get state of your aggregate unhandled exception alarm. If aggregate alarm is in the 'Alarm' state, continue.
Use FilterLogEvents API to get log events matching:
- Your Log Group
- Your Log Stream
- FilterPattern: Your individual unhandled exception alarm's metric filter
- StartTime: alarm timestamp - period
- EndTime: alarm timestamp
Use GetLogEvents API to get all log events matching:
- Your Log Group
- Your Log Stream
- StartTime: alarm timestamp - period
- EndTime: alarm timestamp
If 'all events' count and 'filtered events' count match, and aggregate alarm is in alarm state, do not send a notification. Else, use SES or SNS APIs to send yourself a notification.

If you want to continue being notified via SNS, don't reuse this same topic that the alarm is using to trigger lambda -- create a separate one for your mobile/sms notifications.

I'm not sure if this would be easier than log4net, but if you're intent on doing this sort of post-processing to your logs it may be better to send unhandled exceptions to SNS directly, post-process in lambda first, and then write out to cloudwatch logs from your lambda function. This change would allow you to inspect the unhandled exception via the SNS message payload, and give you some additional control over how to suppress overlapping concerns.