Search code examples
oracledatabase-designetlscd

Oracle 11g - Building a Type 2 SCD based on existing historical data in a relational model


I'm an ETL developer that's currently being tasked with developing a type 2 SCD from existing historical data in a relational database. I'm perfectly capable of creating a type 2 SCD that's responsible for tracking future changes to the data, but I'm completely useless when it comes to the task at hand.

The relational model is in our ODS . Based on that relational model, I'm supposed to build flat records in our DW dimension. There are multiple attributes which need to be monitored for changes, each in specific related tables in the relational model. Historical changes must be kept on a daily basis, and if multiple changes to the same attribute occur on the same day, only the last subsists.

How can I tackle this? I'm lost. Thanks in advance.

P.S. we're talking tables with 20-30 million rows and multiple attributes that may change at any given time and therefore must result in a new record in the SCD.


Solution

  • This will indeed be painful. I'm assuming from your question that the tables containing the attribute values are currently varying independently (or you wouldn't need to ask the question).

    If you have a table 'Table1' containing 'Key', 'Attribute1' and 'Effective From','Effective To' columns, then you can 'explode' that table into a virtual table in the form 'Key','Attribute1','Date', projecting out one row for every date where that attribute was current.

    (Note that you probably don't want to do this as a ranged join against your date dimension, because this will be a Triangular Join (ie perform really badly), you probably need to explode the rows in an ETL tool/programmatically)

    If you perform this process across multiple tables, you will have a set of tables giving you the full day-by-day snapshot of each attribute for every day that you care about. It's then fairly easy to join those tables based on 'FK' and 'Date' to give you the complete daily snapshot across all of the attribute values.

    Then, of course, you need to run this though another process to collapse rows with the same Key, contiguous dates and all the same attribute values, ie 'unexplode' the rows, back into 'effective from','effective to' form. Note again, that this is fundamentally a row-by-row operation (or at very least a windowing function), and a set-based approach will perform very badly. Personally I'd just stream it all though some .net/java code to achieve this.

    Given data volumes this will take a while, but should be achievable.