In the medallion layer architecture (Bronze, silver & gold), when performing incremental ETL (e.g. extracting the last X days of transactions from a source) is it best to partition the bronze layer by extraction or transaction date? I understand that the bronze layer should be the raw data in the delta format, but is it best practice to merge into that layer from a landing zone source? Or partition by extract date and always append.
In the examples I've seen the source is always just producing the latest records, however in my case we're using a sliding window ETL so there are duplications between days - this is as we have records that come into the source 'late' therefore we need to ensure that they are accounted for. Therefore its not a simple case of just being able to append to bronze and there not being overlap.
I am thinking one of the following, which is the best practice:
Its worth noting that the data is large data with potential millions of records for each day.
We have the same architecture and we decided to go with your first solution. Which means, use a raw Layer in order to keep all the historical data as raw files (partitioned by the receiving date) and then merge the data into the Bronze zone.
Regards, Amine.