Search code examples
etldata-warehouse

what is the best way of handling data resubmission in a datawarehouse?


Let's assume that we have a datawarehouse comprised of four components :

  • extract : source data is extracted from an oracle database to a flat file. there is a flat file per source table. Extraction date is kept as part of the flat file name. Each record contains a insert/update date from the source system.
  • staging area : temporary tables used to load the extracted data into database tables
  • operational data store : staged data will be loaded in the ODS. The ODS keeps all the history of all the loaded data and the data is typecast. Surrogate keys are not yet generated.
  • datawarehouse : data is loaded from the ODS, surrogate keys are generated, dimensions are historized, and finally fact data is loaded and attached to the proper dimension.

So far so good, and regarding regular delta loading I have no issue. However the question I ask myself is : I have regularly encountered in the past situations where, for whatever reason, you will want to resubmit extracted data into the loading pipeline. Let's assume for instance that we select all the extracted flat files over the last 15 days, and that we push them again to the ETL process.

  1. There is no new extraction from the source systems. Previously loaded files are re-used and fed into the ETL process.
  2. Data is then reloaded into the staging tables, which will have been truncated previously
  3. now data has to move to the ODS. And here I have a real headache on how to proceed.

Alternative 1 : just insert the new rows. So we would have : row 2, natural key : ID001, batch date : 12/1/2022 16:34, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022 10:43:00 previous row : natural key : ID001, batch date : 10/1/2022 01:00, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022

But then, when loading to the DWH, we need to have some kind of insert/update mechanism and we cannot do a straight insert as it will created duplicate facts.

Alternative 2 : apply an insert/update logic at ODS level. With the previous example we would have :

  • check if the ODS table contains already a row with natural key : ID001 - extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
  • insert if not found

Alternative 3 : purge in the ODS the previously loaded data, i.e.

  • purge all the data where extraction date in the last 15 days
  • load the data from the staging.

Alternative 1 is performant but shifts the insert/update task at DWH level, so the performance-killer is still there. Alternative 2 requires an insert update, which for millions of rows does not seem optimal. Alternative 3 looks good but if feels wrong to delete data from the ODS.

What is your view on this ? In other words my question would be how to reconcile the recommandation to have insert-only processes in the datawarehouse, with the reality that from time to time you will need to reload previously extracted data to fix bugs or correct missing data.


Solution

  • There are two primary methods to load data into your data warehouse:

    Full load: with a full load, the entire data staged is dumped, or loaded, and is then completely replaced with the new, updated data flow. No additional information, such as timestamps or audit technical columns, are needed.

    Incremental load/ Delta load: only the difference between the target and source data is loaded through the ETL process in data warehouse. There are 2 types of incremental loads, depending on the data volumetry , streaming incremental load and batch incremental load.