Search code examples
apache-sparkbigdatadatabricksazure-databricksdatabricks-sql

Is Star Schema (data modelling) still relevant with the Lake House pattern using Databricks?


The more I read about the Lake House architectural pattern and following the demos from Databricks I hardly see any discussion around Dimensional Modelling like in a traditional data warehouse (Kimball approach). I understand the compute and storage are much cheaper but are there any bigger impacts in terms of queries performance without the data modelling? In spark 3.0 onwards I see all the cool features like Adaptive Query Engine, Dynamic Partition Pruning etc., but is the dimensional modelling becoming obsolete because of that? If anyone implemented dimensional modelling with Databricks share your thoughts?


Solution

  • The Kimball's star schema and Data Vault modeling techniques are still relevant for Lakehouse patterns, and mentioned optimizations, like, Adaptive Query Execution, Dynamic Partition Pruning, etc., combined with Data Skipping, ZOrder, bloom filters, etc. are making queries very efficient.

    Really, Databricks data warehousing specialists recently published two related blog posts: