Search code examples
amazon-web-servicesaws-glueaws-step-functions

AWS Step Functions - Wait Until All Jobs under a trigger are finished before proceeding


I have a step function state machine I'm building that is based on 50+ Glue jobs, some of which have dependencies that I'd like capture in the workflow. I see two options of accomplishing this:

  1. Using the StartJob and parallel flow. Have a set of n parallel flows with all jobs without dependencies in the first flow. All jobs that are dependent on jobs in the first flow in the second flow. All jobs that are dependent on jobs in the second flow in the third flow etc... I can then check the "Wait for callback" option within each job that way we only move onto the next parallel flow when every job has sent a callback.

  2. I already have glue triggers that have jobs grouped into logical sets based on the dependencies of those jobs. I wanted to figure out a way to use the StartTrigger module. The issue here is that I want to wait to proceed to the next flow only when all jobs within a trigger are complete. If I specify "Wait for callback" using "StartTrigger" it proceeds to the next step almost immediately as once the trigger has started the task is complete. Is there a way around this?

#2 is obviously the preferable option as I do not want to have drag 60+ StartJob nodes into my step function so I was really wondering if there's a way to accomplish option 2.

The following is my attempt at option #2, but I'm getting an error that the gluejob is missing metadata

{
  "Comment": "A description of my state machine",
  "StartAt": "Parallel",
  "States": {
    "Parallel": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "GetInitialTrigger",
          "States": {
            "GetInitialTrigger": {
              "Type": "Task",
              "Parameters": {
                "Name": "stepfunctionstest"
              },
              "Resource": "arn:aws:states:::aws-sdk:glue:getTrigger",
              "Next": "Map",
              "OutputPath": "$.Trigger"
            },
            "Map": {
              "Type": "Map",
              "ItemProcessor": {
                "ProcessorConfig": {
                  "Mode": "INLINE"
                },
                "StartAt": "Pass",
                "States": {
                  "Pass": {
                    "Type": "Pass",
                    "Next": "Glue StartJobRun"
                  },
                  "Glue StartJobRun": {
                    "Type": "Task",
                    "Resource": "arn:aws:states:::glue:startJobRun",
                    "Parameters": {
                      "JobName": "$.job_name",
                      "Arguments.$":"$.Arguments"
                    },
                    "End": true
                  }
                }
              },
              "InputPath": "$.Actions",
              "MaxConcurrency": 1,
              "End": true
            }
          }
        }
      ],
      "End": true
    }
  }
}
 

Solution

  • I see a couple of ways you could solve this. If you were to go along the lines of your first approach, you could use nested map states as in this example. Here I am just using a Pass and Wait state as placeholders for where you'd use a arn:aws:states:::glue:startJobRun.sync task. I generate an array of "job lists" with each job list being an array of jobs. Then send that into Map with concurrency of 1 so it will iterate through the job lists sequentially. For each job list, it will run the jobs in parallel using another map state without MaxConcurrency set. This example uses Inline map because I don't expect you to have more than 40 items in each list. But you could easily switch to Distributed Map if you had more.

    {
      "Comment": "A description of my state machine",
      "StartAt": "Generate Job Lists",
      "States": {
        "Generate Job Lists": {
          "Type": "Pass",
          "Parameters": [
            [
              "job1",
              "job2",
              "job3",
              "job4"
            ],
            [
              "job5"
            ],
            [
              "job6",
              "job7",
              "job8"
            ]
          ],
          "Next": "Run Job List"
        },
        "Run Job List": {
          "Type": "Map",
          "ItemProcessor": {
            "ProcessorConfig": {
              "Mode": "INLINE"
            },
            "StartAt": "Run Job",
            "States": {
              "Run Job": {
                "Type": "Map",
                "ItemProcessor": {
                  "ProcessorConfig": {
                    "Mode": "INLINE"
                  },
                  "StartAt": "Run a Job",
                  "States": {
                    "Run a Job": {
                      "Type": "Pass",
                      "Next": "Wait"
                    },
                    "Wait": {
                      "Type": "Wait",
                      "Seconds": 1,
                      "End": true
                    }
                  }
                },
                "End": true
              }
            }
          },
          "End": true,
          "MaxConcurrency": 1
        }
      }
    }
    

    Run this and you can see that it does these in batches:

    enter image description here

    I'm not super familiar with Glue Triggers, but I suspect you could use your second approach using a polling loop to start and then monitor for completion, since Step Functions doesn't support .sync for StartTrigger. This blog post (Orchestrating AWS Glue crawlers using AWS Step Functions) demonstrates how to do this for Glue Crawlers, including encapsulation of the polling wrapper in a separate state machine for easier reuse.

    And you could combine these approaches as well I'd imagine.