how to trigger a glue crawler?

setting up a glue crawler job to read from an s3 bucket and create a glue catalog database. once the resource are created , how can i trigger it. can i hook it to s3 object creation ? also, can crawler detect relationship between the data and create the tables accordingly, similar to tables in a relational database, with foreign key to link between tables ? for example, one file has names of product and other files have product details.

[{ "id"   : 1, 
   "product" : "Tablet", 
   }, 
   { "id" : 2, 
     "product" : "headphones"
}]

file 2. [{"name": "Tablet", "price" : 350 ...}, {}]

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  s3_target {
    path = "s3://${aws_s3_bucket.example.bucket}"
  }
}

Solution

You can't directly trigger a Glue Crawler upon S3 object creation. There are different ways you can achieve this that are summarized below:

1- Using Lambda (or Step Functions): Create a Lambda function that gets triggered by s3(you can specify your preferred path) and inside the Lambda function you can add a code like the following to trigger your crawler (you can follow the same approach with Step Functions):

glue = boto3.client(service_name='glue', region_name='your-region-name',endpoint_url='your-endpoint-url')
try:
   glue.start_crawler(Name='your-crawler-name')
except Exception as error:
    print(error)
    print('Failed to start crawler')
    raise error

2- Using EventBridge (CloudWatch or CloudTrail): Create a rule that listens for S3 object upload events. Set the rule target as your crawler. Generally your rule configuration will look like this:

  {
  "source": ["aws.s3"],
  "detail-type": ["Object Access:PutObject"],
  "detail": {
      "eventSource": ["s3.amazonaws.com"],
      "eventName": ["PutObject"],
      "requestParameters": {
          "bucketName": ["<your-bucket-name>"]
      }
  }}

Lastly, you need to set up a trigger:

{
  "Name": "<your-trigger-name>",
  "WorkflowName": "<your-workflow-name>",
  "Type": "event",
  "EventBatchingCondition": {
    "BatchSize": 1
  },
  "Actions": [
    {
      "CrawlerName": "<your-crawler-name>"
    }
  ]
}

You can change the configurations as you desire for example in the above configuration "BatchSize": 1 meaning it will trigger after 1 file is uploaded to s3, you can increase this based on your needs.