amazon-redshift rdbms data-warehouse dimensional-modeling star-schema

Star Schema horizontal scaling

AFAIK, in case of Relational Database on MPP hardware, the key to performance is a correct data distribution. While Dimensional Modeling is about query flexibility, you don't even know how the data will be queried (shuffled) in future.

For example, you have MPP Data Warehouse (Greenplum, Redshift, Synapse Analytics). For example, in 1-2 years, you expect your fact table will grow up to 10 billion of rows and you'll have 15-30 dimension tables of 10s millions of rows. How the data should be distributed accross DW nodes? Is there any common techniques? Like shard fact table and replicate dimension tables. Or should I minimize node amount in MPP DW?

I can bring specific use case, but I believe that the question arise from my misunderstanding of how Dimensional Modeling could be paired with scaling out.

Solution

One technique I’ve seen applied with success in the past is: segment the fact table (e.g., by mod’ing the date key), and distribute all dimensions across all nodes. That way all joins can be done locally.

Note that even with large dimensions, their total size on disk should be a small fraction of the total needed for the fact table.