setting up a glue crawler job to read from an s3 bucket and create a glue catalog database. once the resource are created , how can i trigger it. can i hook it to s3 object creation ? also, can crawler detect relationship between the data and create the tables accordingly, similar to tables in a relational database, with foreign key to link between tables ? for example, one file has names of product and other files have product details.
[{ "id" : 1,
"product" : "Tablet",
},
{ "id" : 2,
"product" : "headphones"
}]
file 2. [{"name": "Tablet", "price" : 350 ...}, {}]
resource "aws_glue_crawler" "example" {
database_name = aws_glue_catalog_database.example.name
name = "example"
role = aws_iam_role.example.arn
s3_target {
path = "s3://${aws_s3_bucket.example.bucket}"
}
}
You can't directly trigger a Glue Crawler upon S3 object creation. There are different ways you can achieve this that are summarized below:
1- Using Lambda (or Step Functions): Create a Lambda function that gets triggered by s3(you can specify your preferred path) and inside the Lambda function you can add a code like the following to trigger your crawler (you can follow the same approach with Step Functions):
glue = boto3.client(service_name='glue', region_name='your-region-name',endpoint_url='your-endpoint-url')
try:
glue.start_crawler(Name='your-crawler-name')
except Exception as error:
print(error)
print('Failed to start crawler')
raise error
2- Using EventBridge (CloudWatch or CloudTrail): Create a rule that listens for S3 object upload events. Set the rule target as your crawler. Generally your rule configuration will look like this:
{
"source": ["aws.s3"],
"detail-type": ["Object Access:PutObject"],
"detail": {
"eventSource": ["s3.amazonaws.com"],
"eventName": ["PutObject"],
"requestParameters": {
"bucketName": ["<your-bucket-name>"]
}
}}
Lastly, you need to set up a trigger:
{
"Name": "<your-trigger-name>",
"WorkflowName": "<your-workflow-name>",
"Type": "event",
"EventBatchingCondition": {
"BatchSize": 1
},
"Actions": [
{
"CrawlerName": "<your-crawler-name>"
}
]
}
You can change the configurations as you desire for example in the above configuration "BatchSize": 1
meaning it will trigger after 1 file is uploaded to s3, you can increase this based on your needs.