Search code examples
data-modelingdata-warehousedimensional-modelingdata-vault

Why using sequence number against version number in dimension table by datawarehouse modeling


In the context of dimension modeling, as typical case, it is great to have surrogate key in dimension table to track the change of rows (http://www.kimballgroup.com/2006/07/design-tip-81-fact-table-surrogate-key/).

There are three common ways to realize surrogate key 1) sequence number 2) version number 3) hash key (used by data vault)

My question is: why sequence number is preferred in most of dimension modeling I have seen.


Solution

  • I think there are a couple of reasons why sequence number is generally used, but I don't think it's a clearly superior way of doing things in all situations.

    Sequence Number

    Pros

    • Sequence numbers are easy. They're so ridiculously easy that for most purposes it's a waste of time considering anything else. Don't let anyone tell you that this isn't the reason we use it.
    • Sequence numbers are guaranteed unique.
    • Sequence numbers are as small (narrow) as you can get.
    • Sequence numbers don't encode any information, so it doesn't matter if you change the contents or even grain of the dimension, as long as the facts know about it. This is important because the grain of dimensions can easily change and as such you should not be using the surrogate key with meaningful data (that's kinda the point of the surrogate key, at least in Kimball-ian DWs)

    Cons

    • The sequence number is essentially a waste of space - if you could encode information in this, even by making it a larger column, you can save space. See Pros above though...
    • I remember seeing some posts about sequence numbers sometimes giving poor write performance because of page locking, but I can't find it now. This may cause slow loading.

    Version Number

    I haven't seen examples of this before, and googling about it seems to turn up this question and some references to appending it to existing fields, so I'm going to assume that you're talking about appending the version to a sequence or hash, or some other identifier.

    Pros

    • You get access to the version number of the data
    • This may be a way of unique-ifying a natural key so that you can use it as the DW dimension key

    Cons

    • Biggest con is that you can't access this data without trimming it out of the key. Why not just have it as a separate column?
    • Natural keys are generally bad practice in a DW, so if that's your motivation you might want to rethink your approach.

    Hash

    If you're not going to use a sequence number, this is probably my preferred option. Needs some pretty specific circumstances though I think

    Pros

    • Great for type 2 slowly changing dimensions - you don't have to store the hash in a separate column, so it saves space
    • One of the few times that coding information into the surrogate key doesn't mean stabbing yourself in the foot for future development.

      Cons

    • If you're using type 1 slowly changing dimensions you've just stabbed yourself in the foot. Updated an attribute? Try updating the primary key without dropping half the database and see how far you get.

    • It's big. This makes your fact tables big, and that makes your database big. If you're using column-based compression this ironically becomes a bigger problem the bigger the dimension gets (to a point...)

    Conclusion

    So it depends on your situation, but sequence number is just so easy to implement and the cons are almost completely negligible in almost all situations, to the point where it sits as a comfortable default. Therefore choosing another option usually falls into the "you have to explain why you did it" category.