Search code examples
data-warehouseetldate-rangestar-schemasnowflake-schema

Data Warehouse: Working with accumulated data


Our data warehouse uses accumulated data (and there is no way to invert the accumulation) from a data source to create a snowflake schema. A requirement we have to cope with is that our schema must be usable to create reports based on date ranges.

Our schema looks like this (simplified):

+------------------------------------------+
| fact                                     |
+-------+-----------------+----------------+
|    id | statisticsDimId | dateRangeDimId |
+-------+-----------------+----------------+
|     1 |               1 |             10 |
|     2 |               2 |             11 |
|     3 |               3 |             12 |
|     4 |               4 |             13 |
|     5 |               5 |             14 |
|     6 |               5 |             15 |
|     7 |               5 |             16 |
|   ... |             ... |            ... |
| 10001 |            9908 |             11 |
| 10002 |            9909 |             11 |
+-------+-----------------+----------------+

+-------------------------------------------------+
| date_range_dimension                            |
+-------+-----------------------------------------+
|    id | startDateTime      | endDateTime        |
+-------+--------------------+--------------------+
|    10 | '2012-01-01 00:00' | '2012-01-01 23:59' |
|    11 | '2012-01-01 00:00' | '2012-01-02 23:59' |
|    12 | '2012-01-01 00:00' | '2012-01-03 23:59' |
|    13 | '2012-01-01 00:00' | '2012-01-04 23:59' |
|    14 | '2012-01-01 00:00' | '2012-01-05 23:59' |
|    15 | '2012-01-01 00:00' | '2012-01-06 23:59' |
|    16 | '2012-01-01 00:00' | '2012-01-07 23:59' |
|    17 | '2012-01-01 00:00' | '2012-01-08 23:59' |
|    18 | '2012-01-01 00:00' | '2012-01-09 23:59' |
|   ... |                ... |                ... |
+-------+--------------------+--------------------+

+-----------------------------------------------------+
| statistics_dimension                                |
+-------+-------------------+-------------------+-----+
|    id | accumulatedValue1 | accumulatedValue2 | ... |
+-------+-------------------+-------------------+-----+
|     1 |    [not relevant] |    [not relevant] | ... |
|     2 |    [not relevant] |    [not relevant] | ... |
|     3 |    [not relevant] |    [not relevant] | ... |
|     4 |    [not relevant] |    [not relevant] | ... |
|     5 |    [not relevant] |    [not relevant] | ... |
|     6 |    [not relevant] |    [not relevant] | ... |
|     7 |    [not relevant] |    [not relevant] | ... |
|   ... |    [not relevant] |    [not relevant] | ... |
|   ... |    [not relevant] |    [not relevant] | ... |
| 10001 |    [not relevant] |    [not relevant] | ... |
| 10002 |    [not relevant] |    [not relevant] | ... |
+-------+-------------------+-------------------+-----+

We want to create our report data set with something like this:

SELECT *
    FROM fact
INNER JOIN statistics_dimension
    ON (fact.statisticsDimId = statistics_dimension.id)
INNER JOIN date_range_dimension
    ON (fact.dateDimId = date_range_dimension.id)
WHERE
    date_range_dimension.startDateTime = [start]
AND
    date_range_dimension.endDateTime = [end]

The problem is that the data in our statistics dimension is already accumulated and we cannot invert the accumulation. We calculated the approximated number of rows in our fact table and got 5,250,137,022,180. There are about 2,5 million date range permutations for our data and we need to calculate them into our date dimension and fact table because of the accumulation. SQL's SUM function does not work for us due to the accumulation (you cannot add two values that belong to non-distinct sets).

Is there a best practice we could follow to make it computationally feasible? Is there something wrong with our schema design?

We need to report data about online trainings. The data source is a legacy data provider with parts that are older than 10 years - so nobody can reconstruct the internal logic. The statistics dimension contains - for example - the progress (in %) a user accomplished in a web-based training (WBT), the number of calls per WBT page, the status of a WBT (for a user, e.g. "completed"), a.s.o.. The important thing about the data provider is: It just gives us a snapshot of the current state. We don't have access to historic data.


Solution

  • I'm assuming you are on some pretty strong hardware for this. Your design has one major drawback - the join between the fact table and the "statistics" dimension.

    Generally, a fact table contains dimensions and measures. It looks to me like it's likely there's a 1-1 relationship between your "statistics" dimension and your fact table. Since fact tables are essentially a "Many-Many" relationship table, it doesn't make sense to have your stats on a separate table. In addition, you say the stats table has information "by user".

    Any time you say "By X" in warehousing, you can almost always be sure that X should be a dimension.

    I would see about building your fact table with the measures directly on it. I'm not sure what you're trying to do with "inverting" the accumulation on the stats table? Do you mean it is accumulated across date ranges? Users? If the data's not atomic, the best you can do is give what you have...