Search code examples
apache-sparkdatabricksetlazure-data-lakedelta-lake

Delta lake medallion layer archirecture, Bronze layer incremental ETL best practices


In the medallion layer architecture (Bronze, silver & gold), when performing incremental ETL (e.g. extracting the last X days of transactions from a source) is it best to partition the bronze layer by extraction or transaction date? I understand that the bronze layer should be the raw data in the delta format, but is it best practice to merge into that layer from a landing zone source? Or partition by extract date and always append.

In the examples I've seen the source is always just producing the latest records, however in my case we're using a sliding window ETL so there are duplications between days - this is as we have records that come into the source 'late' therefore we need to ensure that they are accounted for. Therefore its not a simple case of just being able to append to bronze and there not being overlap.

I am thinking one of the following, which is the best practice:

  1. Land all records from source into a landing zone (non delta, parquet format), partition by extract date in the landing zone. Merge into the Bronze delta tables, partitioned by transaction date.
  2. Append to the bronze delta tables using extract date as a partition column, therefore the bronze delta layer will get very large (due to the repeating records between each days).
  3. Merge into the bronze layer directly from source partitioned by transaction date (ie skip the landing zone) - the only thing I'm worried about here is the possibility that the merge is set up incorrectly and we lose history (maybe a low risk?)

Its worth noting that the data is large data with potential millions of records for each day.


Solution

  • We have the same architecture and we decided to go with your first solution. Which means, use a raw Layer in order to keep all the historical data as raw files (partitioned by the receiving date) and then merge the data into the Bronze zone.

    Regards, Amine.