I'm using an AWS Lambda (hourly triggered by a Cloudwatch rule) to trigger the creation of an EMR cluster to execute a job. The EMR cluster once finished its steps write a result file in a S3 bucket. The key path is the hour of the day
/bucket/2017/04/28/00/result.txt
/bucket/2017/04/28/01/result.txt
..
/bucket/2017/04/28/23/result.txt
I wanted to put some alert in case for some reason the EMR job failed to create the result.txt for the hour.
I have already put some alerts on the Lambda invocation count and on the lambda error count but I didn't manage to find the appropriate alert to test that the EMR actually correctly finishes its job.
Note that the Lambda is triggered every 3 min past the hour and takes about 15 minutes to complete. Would a good solution be to create an other Lambda that is triggered every 30min past the hour and checks that the correct key is present in the bucket? if not then write some logs to cloudwatch that I could monitor and use them to create my alert?
What other way could I achieve this alerting?
S3 offers free metrics on object count per bucket, but doesn't publish often enough for your use case.
For a cost, you can enable CloudWatch metrics for S3 requests to enable request metrics that write data in 1-minute periods. You could, for example, create a relevant alarm on the following S3 CloudWatch metrics:
PutRequests sum <= 0
over each hour4xxErrors sum >= 1
over 1 minute5xxErrors sum >= 1
over 1 minuteThe HTTP status code alarms on much shorter intervals (down to 1 minute), will offer feedback nearer to when these failures occur.
If you don't want to incur the cost of S3 request metrics, you could instead configure an event to publish a message to an SNS topic on S3 put. You can use CloudWatch to set up alerting on the sum of messages published (or lack thereof).
You could then create a CloudWatch alarm based on this topic failing to publish a message.
Dimensions: TopicName = YOURSNSTOPIC
Namespace: AWS/SNS
Metric Name: NumberOfMessagesPublished
Threshold: NumberOfMessagesPublished <= 0 for 60 minutes (4 periods)
Statistic: Sum
Period: 15 minutes
Treat missing data as: breaching
Actions: Send notification to another, separate SNS topic that sends you an email/sms, or otherwise publishes to some alerting service.
Note that both CloudWatch solutions have the caveat that they won't fire alerts exactly at 30 minutes past the hour, but they will capture your entire monitoring period.
You may be able to further configure from these base examples by adjusting your period or how cloudwatch treats missing data to get better results.
A lambda that triggers 30 minutes past the hour (via cron-style scheduling) to check S3 request metrics or the SNS topic's "NumberOfMessagesPublished" metric instead of relying on CloudWatch alarms could also accomplish this. This may be a better alternative if firing exactly 30 minutes past the hour is important, as the CloudWatch alarm's firing time will not be as precise.