Search code examples
etldata-warehousesql-data-warehouse

what are the steps I need to perform to clean the data if data into the dimension/fact table improperly loaded


Suppose there is a scenario where there is a data loading process into the fact table\dimensional table, and after analysis found that 100 millions records are being improperly loaded, what are the steps I need to perform to clean the data properly.


Solution

  • Here are two practices which help in that scenario:

    1. Take a backup or snapshot before each batch. In the case of a major error like this you can roll back to the snapshot, reload and process the correct data.

    2. Maintain an insert-only persistent staging area in the DW, such as a data vault, with each row stamped with a batch ID and timestamp. Remove the rows in error, and rebuild your facts and dimensions.

    If this represents a real situation your only chance is #1.

    If you don't have a reliable backup, and you have updated and/or deleted rows during the ETL/ELT process, you don't have any record of the pre-fail state and it may be impossible to go back.