Search code examples
amazon-web-servicesaws-glue

how to trigger a glue crawler?


setting up a glue crawler job to read from an s3 bucket and create a glue catalog database. once the resource are created , how can i trigger it. can i hook it to s3 object creation ? also, can crawler detect relationship between the data and create the tables accordingly, similar to tables in a relational database, with foreign key to link between tables ? for example, one file has names of product and other files have product details.

[{ "id"   : 1, 
   "product" : "Tablet", 
   }, 
   { "id" : 2, 
     "product" : "headphones"
}]

file 2. [{"name": "Tablet", "price" : 350 ...}, {}]

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  s3_target {
    path = "s3://${aws_s3_bucket.example.bucket}"
  }
}

Solution

  • You can't directly trigger a Glue Crawler upon S3 object creation. There are different ways you can achieve this that are summarized below:

    1- Using Lambda (or Step Functions): Create a Lambda function that gets triggered by s3(you can specify your preferred path) and inside the Lambda function you can add a code like the following to trigger your crawler (you can follow the same approach with Step Functions):

    glue = boto3.client(service_name='glue', region_name='your-region-name',endpoint_url='your-endpoint-url')
    try:
       glue.start_crawler(Name='your-crawler-name')
    except Exception as error:
        print(error)
        print('Failed to start crawler')
        raise error
    

    2- Using EventBridge (CloudWatch or CloudTrail): Create a rule that listens for S3 object upload events. Set the rule target as your crawler. Generally your rule configuration will look like this:

      {
      "source": ["aws.s3"],
      "detail-type": ["Object Access:PutObject"],
      "detail": {
          "eventSource": ["s3.amazonaws.com"],
          "eventName": ["PutObject"],
          "requestParameters": {
              "bucketName": ["<your-bucket-name>"]
          }
      }}
    

    Lastly, you need to set up a trigger:

    {
      "Name": "<your-trigger-name>",
      "WorkflowName": "<your-workflow-name>",
      "Type": "event",
      "EventBatchingCondition": {
        "BatchSize": 1
      },
      "Actions": [
        {
          "CrawlerName": "<your-crawler-name>"
        }
      ]
    }
    

    You can change the configurations as you desire for example in the above configuration "BatchSize": 1 meaning it will trigger after 1 file is uploaded to s3, you can increase this based on your needs.