Search code examples
data-warehouse

Data warehouse schema: is it OK to directly link fact tables in DWH?


Is it OK to directly link fact tables in DWH?

As I understand, in galaxy schema fact tables are not linked, they just have common dimension table. But, if there is a DWH schema that assumes to link them directly?


Solution

  • IMO, they shouldn’t, even if they can. Fact tables are usually huge, with potentially many billions of rows, and hold measures at a certain grain.

    Linking two or more fact tables may require joining several multi billion row tables which will be too expensive.

    If you need to link facts in different fact tables (all dimensions are common) you’re better off doing the join only once, storing the results and using that resulting table instead. Even better if this can be done at ETL level, where you can join batch by batch.

    If you join facts in two tables where one’s dimensions are a superset of the other’s, you’re better off aggregating the most granular facts to the other’s granularity and apply the solution above.

    If neither set of dimensions is a superset of another then you may need to aggregate both at a common level.

    The reason behind my position is that I’d rather have redundancy in storage and avoid query time calculations than have my users wait a long time for those joins to produce results. Also, very large joins need a lot memory which is normally more expensive than storage.

    Finally, remember a DWH normally has data loaded by ETL processes. They run in batch and can check for consistency at each run, unlike OLTP where avoiding multiple writes of same data is paramount to prevent inconsistency.

    Opinions on this differ and you’ll most likely get different views on the matter. In the end, both approaches have their pros and cons, study both and pick the one you’re most comfortable with.