We are using glue for computing purposes. The data flow is happening like this landing->raw->stage->curated->Redshift
.
However, when the everyday the data flows right -> the data is exactly getting doubled.
For example:
In Redshift, I would like to see 120 records at end of August 2. Instead of that, it is getting 220 records. Please refer me to a way to avoid this scenario.
Would like to retain partition based on the run date in both raw and stage.
It seems that you want to track files that have already been processed. You can prevent that by using the job bookmarking feature of Glue.