I have a use case in which I need to create an AWS Glue Crawler to crawl some data stored in S3, start the crawler, then delete the crawler after it has finished crawling the data.
The dilemma I've ran into is that the crawler can take a significant amount of time to complete, sometimes taking 20-30 minutes to finish crawling the actual data before it can be deleted.
Initially I had intended to solve this with the AWSGlueAsyncClient, so that rather than blocking the calling thread for 20-30 mins, I could just write a callback so that when the crawler finished, it would immediately be deleted.
The issue with this is that if the server was to go down or be interrupted during this 20-30 minute window the crawler takes to complete, it would no longer get deleted.
What would be a good means to persist the crawler deletion step so that even if the server were to go down, it would still attempt to delete the crawler after it started back up again? A database seems like overkill.
You can setup EventBridge rule to trigger a lambda function when a crawler completes. The function would then delete the crawler. Example rule is:
{
"source": [
"aws.glue"
],
"detail-type": [
"Glue Crawler State Change"
],
"detail": {
"state": [
"Succeeded"
]
}
}