Search code examples
sqlamazon-web-servicesaws-lambdaetlaws-glue

Prevent files from being processed multiple times in AWS Glue


We are using glue for computing purposes. The data flow is happening like this landing->raw->stage->curated->Redshift.

However, when the everyday the data flows right -> the data is exactly getting doubled.

For example:

  • Aug 1: I have 100 records
  • Aug 2: I have 20 records

In Redshift, I would like to see 120 records at end of August 2. Instead of that, it is getting 220 records. Please refer me to a way to avoid this scenario.

Would like to retain partition based on the run date in both raw and stage.


Solution

  • It seems that you want to track files that have already been processed. You can prevent that by using the job bookmarking feature of Glue.