I'm having an issue when passing events to the AWS lambda function. I have to sync 60k data in 1 json file to a remote server. My solution is to divide the file into many small files (2000 data/ file) since the function cannot process 60k data in 15 minutes. After breaking the files, I will upload them one by one to S3 (time interval is 10 seconds), when a new file is uploaded, it will send a S3Event notification to trigger the function, and that function execution will take the newest uploaded file to process. The speed is impressive, but when the number of data reaches 60k, it does not stop, it continues adding more 4k items (I don't know where it came from). When I check the newest CloudWatch log, it seems there are some executions that have more than 1 S3Events like below
"Records": [
{
"eventVersion": "2.1",
"eventSource": "aws:s3",
"awsRegion": "ap-southeast-1",
"eventTime": "2023-09-08T06:26:38.873Z",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "AWS:AIDA4EODLYGLIMIHEIAVB"
},
"object": {
"key": "test_4.json",
"size": 846900,
"eTag": "44768b0689b2acd120c7683d8f0ce236",
"sequencer": "0064FABE9E323A01CD"
}
}
}
]
"Records": [
{
"eventVersion": "2.1",
"eventSource": "aws:s3",
"awsRegion": "ap-southeast-1",
"eventTime": "2023-09-08T06:28:06.642Z",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "AWS:AIDA4EODLYGLIMIHEIAVB"
},
"object": {
"key": "test_12.json",
"size": 845141,
"eTag": "05008d0d12413f98c3aefa101bbbd5eb",
"sequencer": "0064FABEF5B003C033"
}
}
}
]
This is so weird, This is the 1st time I have worked with AWS so can someone help me explain the reason? Also, Is it possible to make all execution run synchronously instead of asynchronously? I think if they are running in order, it will be easily to debug then.
Though you have introduced a 10-second interval for uploading files to the S3 bucket, it appears that the S3 bucket combines some of the file events into a single S3 event. This is the default behavior and cannot be controlled through configuration. Furthermore, S3 provides reliable delivery, ensuring at least one delivery, which can result in duplicate messages.
You can follow the below approach to address your problem:
Instead of relying on S3 events to trigger the Lambda function, implement a logic within your file-divider solution to publish filepath into an SQS FIFO (First-In-First-Out) queue, which is immediately after uploading a file to S3 bucket. Next, configure SQS to trigger your Lambda function. When setting up the SQS trigger for the Lambda function, you can adjust the batch size of the SQS event payload to 1
. This will ensure that your Lambda function will consistently receive one file event at a time. As you will be using the SQS FIFO type, it will guarantee the message order and prevent message duplicates.