Let's assume that we have a datawarehouse comprised of four components :
So far so good, and regarding regular delta loading I have no issue. However the question I ask myself is : I have regularly encountered in the past situations where, for whatever reason, you will want to resubmit extracted data into the loading pipeline. Let's assume for instance that we select all the extracted flat files over the last 15 days, and that we push them again to the ETL process.
Alternative 1 : just insert the new rows. So we would have : row 2, natural key : ID001, batch date : 12/1/2022 16:34, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022 10:43:00 previous row : natural key : ID001, batch date : 10/1/2022 01:00, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
But then, when loading to the DWH, we need to have some kind of insert/update mechanism and we cannot do a straight insert as it will created duplicate facts.
Alternative 2 : apply an insert/update logic at ODS level. With the previous example we would have :
Alternative 3 : purge in the ODS the previously loaded data, i.e.
Alternative 1 is performant but shifts the insert/update task at DWH level, so the performance-killer is still there. Alternative 2 requires an insert update, which for millions of rows does not seem optimal. Alternative 3 looks good but if feels wrong to delete data from the ODS.
What is your view on this ? In other words my question would be how to reconcile the recommandation to have insert-only processes in the datawarehouse, with the reality that from time to time you will need to reload previously extracted data to fix bugs or correct missing data.
There are two primary methods to load data into your data warehouse:
Full load: with a full load, the entire data staged is dumped, or loaded, and is then completely replaced with the new, updated data flow. No additional information, such as timestamps or audit technical columns, are needed.
Incremental load/ Delta load: only the difference between the target and source data is loaded through the ETL process in data warehouse. There are 2 types of incremental loads, depending on the data volumetry , streaming incremental load and batch incremental load.