Search code examples
sqlsql-serverdata-warehousebusiness-intelligencedimensional-modeling

Put hierarchy in its own separate dimension or set it as a part of related dimension?


I'm new to the dimensional model in dataware house design, and I face some confusion in my first design.

I take a simple business process (Vacation Request), and I want to ask which of these two designs is accurate and near to be effective and applicable, I will be grateful if I get a detailed answer please? (My question about the dimension design mainly)

1- 

Dimension.Employee           Fact.Vacation
[Employee Key]               [Employee Key] FK PK
[_source key]                [Vacation Transaction]PK DD
[Employee Name]              ...
....                         ...
[Campus Code]
[Campus Name]
[Department Code]
[Department Name]
[Section code]
[Section Name]
.... 

2- 

Dimension.Employee           Dimension.Section         Fact.Vacation
[Employee Key]               [Section Key]             [Employee Key] FK PK
[_source key]                [_source key]             [Vacation Transaction]PK DD
[Employee Name]              [Department Code]         [Section Key]FK
....                         [Department Name]         ...
....                         [Campus Code]
                             [Campus Name]

Where the hierarchy is like this:

Campus Contains --> 
Many Departments and each department contains --> 
many sections and each section contains many employees

Solution

  • Good question! I've faced this situation a number of times myself. I'm afraid this is going to get a bit confusing and the final answer will be, "it depends", but here are some things to consider...

    • WHAT A STAR SCHEMA REALLY IS: While people think of a data warehouse as a reporting database, it actually performs two functions: data integration and data distribution. It turns out that the best data structures for integration are not great for distribution (here's a blog post I wrote about this some years ago). Star schemas are really about data distribution - getting data out of the system quickly and easily. The best data structures for this have no joins, i.e. they are akin to flat files (yes, I realize there are some DB buffering considerations that might affect this a bit but, in a general sense, indexed flat files do avoid all joins).

    The star schema takes that flat file and normalizes it a little, largely to save disk space (it's a huge space waster when you have to write out every attribute of each dimension on every record). So, when people say a star schema is denormalized, they are partially incorrect. The dimension tables are denormalized (snowflake schemas normalize these) but the fact table is normalized - it's got a bunch of attributes dependent on a unique primary key.

    So, this concept would point to minimizing the number of dimensions in order to minimize the number of joins you need to make. Point for putting them into one dimension.

    • FACT TABLE SHOWS RELATIONSHIPS: Your fact table shows the relationship between otherwise unrelated dimension elements. For example, in the absence of a sale, there is no relationship between a product and a customer. A sale creates that relationship and the sale fact record models it. In your case, there is a natural relationship between section and employee (I assume, at least). You don't need a fact table to model this relationship and, therefore, they should both be in one dimension table. Another point for putting them into one dimension.

    • CAN AN EMPLOYEE BE IN MULTIPLE SECTIONS SIMULTANEOUSLY?: If an employee can be in multiple sections at the same time then you probably do need the fact table to model this relationship (otherwise, each employee would need two active records in the employee dimension table). Point for putting them into separate dimensions.

    • DO EMPLOYEES CHANGE SECTIONS FREQUENTLY?: If so, and you only have one dimension, you'll end up having to constantly be modifying the employee record in your employee dimension - this leads to a longer than needed dimension table if this is a type two slowly changing dimension (i.e. one where you're tracking the history of changes to dimension elements). Point for putting them into separate dimensions.

    • DO YOU NEED TO AGGREGATE BY SECTION?: If you have a lot of sections and frequently report at the section level, you may need to create a fact table aggregated at the section level. In this case, if you're a staunch believer in having your DB enforce your relationships (I am), you'll need a section table to which your fact table can relate. In this case you'll need a section table. Point for putting them into separate dimensions.

    • WILL THE SECTION DIMENSION BE USED BY OTHER FACT TABLES?: One tough situation with star schemas occurs when you're using conformed dimensions (i.e. dimensions that are shared by multiple fact tables). The problem occurs when the different fact tables are defined at different levels in the dimension hierarchy. In your case, imagine there is a different fact table, say one modeling equipment purchases and it only makes sense at the section, not the employee, level. In this case, you'd probably split the section dimension into its own table so it can be shared by both fact tables, your current one and that future, equipment one. This, BTW, is similar to the consideration related to aggregate tables mentioned earlier. Point for putting them into separate dimensions.

    Anyhow, that's off the top of my head. So, I guess the answer is, "it depends". Both will work, it just depends on other factors you're trying to optimize for. I'll try to edit my answer if anything else comes to mind. Best of luck with this!