Search code examples
aws-lambdaamazon-sqsaws-step-functions

AWS SQS triggering Lambda - How to stop Lambda from taking in more SQS events until a certain task is complete


I've got an SQS queue that triggers a handler Lambda. This Lambda simply takes in messages in the queue and executes a Step Functions state machine, with the message as the input.

The Lambda ends when it receives an HTTP response from Step Functions that the state machine began executing.

The state machine has as one of its tasks a Glue job with a concurrency limit of 1. So the flow goes:

SQS -> Lambda -> State machine (containing a Glue job)

The state machine steps:

  1. Pass some arguments around in the input message
  2. Run a Glue job task with the arguments
  3. Etc. etc.

When an SQS event triggers a Lambda, it's automatically taken off the queue.

Desired outcome

The Glue job task in the state machine that can only run one at a time. So I want the whole state machine to run only one at a time. I probably need new incoming events in the SQS queue to stay until the current state machine run finishes.

The problem

Currently, if the state machine is already running, the Lambda will begin a second execution of the state machine.

But since there's a Glue job task is still running, and the second instance of the state machine attempts to run the job as well, Glue will give a failure. The following error is returned during the second execution of the state machine:

{
  "resourceType": "glue",
  "resource": "startJobRun.sync",
  "error": "Glue.ConcurrentRunsExceededException",
  "cause": "Concurrent runs exceeded for GLUE_JOB_NAME (Service: AWSGlue; Status Code: 400; Error Code: ConcurrentRunsExceededException; Request ID: 60ea8feb-34a2-46e2-ac17-0152f22371a2; Proxy: null)"
}

This makes the state machine fail, and the SQS event which triggered the Lambda to begin the state machine, is lost forever; the state machine will not attempt to act on the event again.

Solutions I've considered

1)

Instead of making the SQS queue trigger the Lambda as events come in, I could make the Lambda time-scheduled instead, checking the state machine for a current execution. If there isn't, it'll fetch from the queue and begin a state machine.

This is probably the simplest solution, but the downside is that it'll leave events in the queue for minutes at a time, and more importantly, there's already a separate polling Lambda before this which is putting events in the SQS queue, so having another time-scheduled Lambda is tautological.

2)

The concurrency of the Glue job is not something I want to change.

However, if I make the Lambda poll Step Functions to see if there's an instance of the state machine running already, then I can make the Lambda retry later.

If I then give the Lambda a concurrency of 1, then while the Lambda function is waiting, the SQS queue will not trigger more instances of the function. New events in the queue will be blocked until the current state machine execution finishes.

The problem is that we're running the Lambda the entire time that the state machine is executing, which might take a long while. This makes an unnecessarily long Lambda running time and billing time. It also might go over the Lambda runtime limit.

3)

The Lambda can poll Step Functions for a current execution, and if there is, it can return a runtime error, which I believe will put the SQS event back onto the queue to retry later.

But as far as I know, SQS will trigger the Lambda immediately afterwards, even if there is a delay window. Besides, I don't want a delay window in cases where there won't be a current execution.


I want to ask for a better solution than these 3, but if there's not, I'll resort to 1).


Solution

  • This setup can be used :

    A lambda function can be used which polls for jobs from sqs and then triggers the state machine. Now there should be two triggers for the lambda to make it fail-safe.

    1. The primary trigger would be the event emitted on completion of state machine. This would solve the synchronicity problem. https://docs.aws.amazon.com/step-functions/latest/dg/cw-events.html
    2. A time-scheduled event to the lambda. This would ensure the lambda to not miss any new jobs after the queue is empty. (In the beginning of the lambda a code segment would check if there are already any executions of the state machine running. If yes then exit)