Search code examples
reportingdata-modelingdata-warehouseolapdimensional-modeling

Dimensional modeling on columnar databases


I have started learning cloud architecture and found out that they all are using columnar databases which claims to be more efficient as they are storing column rather than a row to reduce duplicate.

From a data mart perspective (lets say for an organization a department only want to monitor internet sales growth and some other department want to focus on outlet performances), how can I design an architecture which can handle data load and provide easy data access. I know how data mart can be easily designed on top of it and end user doesn't have to bother about calculation at all.

I had have experience in SSAS (OLAP) in which all calculation on a large data warehouse is already computed and a normal business user can directly connect to the cube and analyze the data with self service BI tool (as simple as drag and drop) on the other hand columnar databases seems to follow ELT approach and leaves all computation on either queries(views) or on reporting tool.

As I have experience in SQL Server, I presume that my query (for example below)

SELECT 
  region,
  state,
  City,
  Country,
  SUM(Sales_Amount),
  AVG(Discount_Sale),
  SUM(xyz)
  ....
FROM Columnar_DataTable

is going to scan complete table which can increase cost. Imagine if above query is executed more than 1000 times in a day for a large enterprise.

So, is it appropriate to create a OLAP on top of columnar databases with dimensional modeling or it is better to load the data first then filter/transform it on reporting tool? Considering that most Self service BI tool already have this in mind and limit the usage of data consumption (ex: Power BI desktop community edition allows 10 GB per data set) and forces user to do his/her own calculation.

  • If we segregate the data into multiple tables then all reporting tools, anyways, needs relation between tables for filtering.

  • If we keep single table format then reporting tool has to read all data before making any calculation.


Solution

  • Business analysis queries do often involve computing aggregations for metrics, like the total sales amounts and the average discount that you exemplified.

    OLAP data structures are useful for these use cases because the aggregations can be precomputed and stored, thereby requiring less computation and I/O at query time and speeding up the query patterns used in these use cases.

    The OLAP approach gained momentum (also) because a typical relational database was less performant in these scenarios and OLAP turned out to be an effective optimization.

    The columnar database approach (in analytics-oriented databases) is also meant to optimize these use cases, mostly by structuring and storing data in a way that only selected columns, like labels and measures for aggregations, have to be read from storage. This requires less I/O and is one of the main reasons why columnar formats offer great performance for these use cases (the others being sophisticated partitioning, parallel processing, compression and metadata, like in Apache Parquet).

    So, regarding your question, I'd say that you should only worry about precomputing aggregations in a columnar database if you experience low performance in ad hoc query scenarios, and cannot solve it in more immediate ways (like caching, proper partitioning and compression). But this also depends on what database/saas/file format you use.

    As to dimensional modelling, that's a different issue. If you use a columnar file format like Parquet it could actually be desirable (depending on the user and use case) to use something like Hive to create a (meta) dimensional model over the files, so that e.g. you can expose database tables and a SQL interface to your users instead of a bunch of files.

    Regarding PowerBI, like with most reporting tools you can use it in Direct Query mode if users will indeed be working with datasets over 10GB.

    PS: in a columnar database that specific piece of SQL will not "scan the complete table", it will only scan the columns you select; that's part of the optimization of a columnar design.