Search code examples
databasedata-warehousefact-table

Data Warehouse - How to store created_time, created_date, complete_time, complete_date


My task table has 4 columns to store created_time, created_date, completed_time, completed_date.

When I convert that table to OLAP, Do I want to store them under Date Time dimension or is it ok to keep them in Fact table.

Can someone please explain. Thank you.


Solution

  • Assuming you are using a star-schema, a date dimension usually acts more than just a lookup table. It usually contains a good number of columns describing the specific date in the fact table, such as is it a holiday, which quarter is it on, which fiscal quarter is it on, etc.

    Constructed in this way, business can ask questions such as how many tasks were completed in 1st quarter (without having to enter exact start and end dates for that 1st quarter).

    The answer to your question depends on the type of queries you expect the user to ask you. If a query such as the above is likely, then yes, create a comprehensive date dimension to store dates information.

    Of course this makes your queries use FKs (or pointer columns to date dimension) and will make you use joins. Joins could slow performance slightly for very large tables. However, the star schema is based on this concept.

    Date dimension has to be initialized with some data rows usually covering 1 or 2 years besides the current year (or maybe more).

    Now, we talk about time columns. It is not recommended to build time in the date dimension (see link). If you build time in the date dimension, the date dimension will be needlessly huge.

    I recommend that you place the time columns in the fact table only, whether you use a time dimension or not. I also recommend that you include calculated columns in the fact such as total duration in days, months, years and hours in the fact table (assuming this information serve queries such as how many tasks took 5 hours to finish). You need to make the calculations during ETL. You can't just subtract end-time from start-time without having the dates. You also don't want to get into such calculations during query time, otherwise the queries would be complex.

    This type of denormalization may be acceptable by many within the star schema model and has a minor disadvantage of making the fact longer. There are ways to make calculated columns virtual, but you may decide to persist the calculated columns. In such a case, If your fact is long and you have huge number of fact tables, you may decide to create a special fact table that is associated in 1-1 relationship with the main fact to make processing faster, that new fact will be smaller, and faster to load. However, this is likely not to be the case in many applications, that is 1 fact will do the job just fine.

    This may also help: Kimball-Latest Thinking On Time Dimension Tables.