Search code examples
databasedatabase-designdata-warehouse

Data warehouse design - periodic snapshot with frequently changing dimension keys


Imagine a fact table with a summation of measures over a time period, say 1 hour.

Start Date          | Measure 1 | Measure 2
-------------------------------------------
2018-09-08 00:00:00 | 5         | 10
2018-09-08 00:01:00 | 12        | 20

Ideally we want to maintain the grain such that each row is exactly 1 hour. However, each row references dimensions which might ‘break’ the grain. For instance:

Start Date          | Measure 1 | Measure 2 | Dim 1
---------------------------------------------------
2018-09-08 00:00:00 | 5         | 10        | key 1
2018-09-08 00:01:00 | 12        | 20        | key 2

It is possible that the dimension value may change 30 minutes into the hour in which case, the above would be inaccurate and should be represented like this:

Start Date          | Measure 1 | Measure 2 | Dim 1
---------------------------------------------------
2018-09-08 00:00:00 | 5         | 10        | val 1
2018-09-08 00:00:30 | 5         | 10        | val 2
2018-09-08 00:01:00 | 12        | 20        | val 2

In our scenario, the data needs to be sliced by at least 5 dimension keys with queries like:

sum(measure1) where dim1 = x and dim2 = y..

Is there a design pattern for this requirement? I have considered ‘periodic snapshots’ but I have not read anywhere about this kind of row splitting on dimension changes.

I can see only two options:

  1. Store the dimension values that were most present on each row (e.g. if a dimension value was true for the majority of the time in the hour, use this value). This would lead to some loss of accuracy.
  2. Split each row on every dimension change. This is complex in the ETL, creates more data and breaks the granularity rule in the fact table.

Option 2 is the current solution and serves the purpose but is harder to maintain. Is there a better way to do this, or other options?

By way of a real example, this system records production data in a manufacturing environment so the data is something like:

Line   | Date                | Crew   | Product   | Running Time (mins)
-----------------------------------------------------------------------
Line 1 | 2018-09-08 00:00:00 | Crew A | Product A | 60

As noted, the crew, product or any of the other dimension may change multiple times within the hour.


Solution

  • You shouldn't need to split the time portion of your fact table since you clearly want to report hourly data, but you should have two records, one for each dimension value. If this is an aggregate of a transactional fact table, your process that loads the hourly table should be grouping each record by each dimension key. So in your example above, you should have two records for hour like so:

    Start Date          | Measure 1 | Measure 2 | Dim 1
    ---------------------------------------------------
    2018-09-08 00:00:00 | 5         | 10        | val 1
    2018-09-08 00:01:00 | 5         | 10        | val 1
    2018-09-08 00:01:00 | 12        | 10        | val 2
    

    You will need to take into account the other measures as well and make sure they all go into the correct bucket (val 1 or val 2). I split them evenly in the example.

    Now if you slice by hour 1 and by Dim 1 Value 2, you will only see 12 (measure 1), and if you slice on hour 1, dim 1 value 1, you will only see 5, and if you only slice on hour 1, you will see 17.

    Remember, your grain is defined by the level of each dimension, not just the time dimension. HTH.