Search code examples
hivedata-warehousedimensional-modeling

Is it better to have a surrogate key or nk+effective_time in dimension tables in apache hive


Lets say, there is a SCD2 dimension table - location. The natural key is country, state and city combined. Since it is SCD2 table, eff date is also part of the key.

Is it better to have the surrogate key as usavirginarichmond20110101 or create an actual numerical key using row_number() in hive?

Why one approach is better over another?


Solution

  • (Note on terminology: combination of natural keys is called "composite key", not surrogate key, and it's still a "natural key". Surrogate key (aka Synthetic key) is a sequential integer that has no business meaning).

    Short answer: since your dimension is SCD2, definitely use surrogate/synthetic keys. Handling SCD with natural/composite keys is a pain.

    Longer answer: Surrogate (SK) vs Natural keys (NK) design is an on-going debate. Each has pros and cons. My approach is to always use surrogate keys in data warehouse (DW). It means some extra ETL work, but that's an acceptable cost because surrogate keys have some important advantages:

    1. SCD handling is much easier. If you have SCDs, using natural keys is rather cumbersome and ugly. Synthetic keys don't have the problem;

    2. System-wide consistency: because of SCD, it's highly likely that you will have to use SKs in your Data Warehouse at least in some tables. It makes sense then to consistently use them in all tables. Mixing SK and NK designs is ugly;

    3. Composite NKs can often be large and complex alpha-numeric strings. It means that they might substantially increase table sizes, and joins might be slower. SK is a simple integer, with predictable size and consistent join speed;

    4. NKs can be a source of bugs and instability in DW. For example, some databases re-use their natural keys, and as a result their meaning might change over time. In DW that relies on NKs that's a potential disaster. Also, NKs might come from a wide variety of sources, and lead to integration conflicts.

    There are other considerations, but in my experience, systematically using surrogate keys makes DW design more reliable and efficient.