Search code examples
data-vault

Datavault: How to get hashes for foreign key relationships (populating link tables)


I've read the data vault book end to end, but I'm still trying to resolve one specific thing related to how you'd populate the link tables (how to get all hashes for that). From the blog of scalefree: massively parallel processing, it demonstrates that satellites and hubs can be loaded in full parallel fashion, but it doesn't go into a lot of detail related to the link tables.

Links require hash keys, thus in some way 'business keys' from multiple tables to establish the relationships, that's what they do, they record relations between hubs. There aren't very good explanations or in-depth explanations how you would retrieve the business keys of related entities when populating these link tables.

For a specific table like 'customer' things are easy for hub and satellite: just convert the business key to a hash and load both of them in parallel.

But a customer details table or a transaction table from an OLTP need some kind of join to happen to look up the business key for the customer or to look up all the related entities in the transaction (product, customer, store, etc), because those tables do not typically store (all) business key(s) as an attribute in the table.

If I assume that staging is loaded incrementally and truncated, then staging doesn't necessarily have all the entities loaded to be able to perform joins there. How to resolve this dilemma and create a design that works?

  1. Join on tables in the source OLTP systems to generate the business keys from there and propagate them as hashes from there? (this ends up wrong if the business key was chosen incorrectly)
  2. Use a persistent staging area, so never truncate? (then it's always possible to join on any table in there to resolve)
  3. Use some kind of index for surrogate keys -> business keys and perform a lookup from there? (minimizes I/O a bit further and is a mix between incremental staging and persistent staging).
  4. some other method...?

Essentially, what is the best practice for generating the hashes for all foreign key relations of your OLTP systems?


Solution

  • I talked to an expert about this and this is the answer I accepted from him:

    The only sensible two ways to produce hashes for tables that do not have all the columns necessary to produce a business key for that table is:

    • In the case where you have a full load of all the tables that have the business keys (yet maybe incremental for a link table), join to the relevant source tables having the business keys in staging. This is ok, because you can guarantee you have all the data in staging at that moment.
      • In the case where you have incremental loads for tables having business keys, you must use a persistent staging area (PSA) to do this for you.

    It is considered bad practice to join tables in source system queries in order to generate the business keys. The reason is that the data warehouse should have as little operational impact as possible.