Search code examples
data-warehouse

In data warehouse, can fact table contain two same records?


If a user ordered same product with two different order_id; The orders are created within a same date-hour granularity, for example

order#1 2019-05-05 17:23:21
order#2 2019-05-05 17:33:21

In the data warehouse, should we put them into two rows like this (Option 1):

| id  | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111     | 22          | 123      | 456      | 10    | 1        |
| 002 | 1111     | 22          | 123      | 456      | 10    | 2        |

Or just put them in one row with the aggregated quantity (Option 2):

| id  | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111     | 22          | 123      | 456      | 10    | 3        |

I know if I put the order_id as a degenerate dimension in the fact table, it should be Option 1. But in our case, we don't really want to keep the order_id.

Also I once read an article that says that when all dimensions are filtered out, there should be only one row of data in the fact table. If this statement is correct, the Option 2 will be the choice.

Is there a principle where I can refer ?


Solution

  • Conceptually, fact tables in a data warehouse should be designed at the most detailed grain available. You can always aggregate data from the lower granularity to the higher one, while the opposite is not true - if you combine the records, some information is lost permanently. If you ever need it later (even though you might not see it now), you'll regret the decision.

    I would recommend the following approach: in a data warehouse, keep order number as degenerate dimension. Then, when you publish a star schema, you might build a pre-aggregated version of the table (skip order number, group identical records by date/hour). This way, you can have smaller/cleaner fact table in your dimensional model, and yet preserve more detailed data in the DW.