Search code examples
databaseaggregatebusiness-intelligencecubeolap-cube

How to structure/implement multidimensional data / data cube


I've been reading into what a data cube is and there are lots of resources saying what it is and why (OLAP/ business intelligence / aggregations on specific columns) you would use one but never how.

Most of the resources seem to be referencing relational data stores but it doesn't seem like you have to use an RDBMS.

But nothing seems to show how you structure the schema and how to query efficiently to avoid the slow run time of aggregating on all of this data. The best I could find was this edx class that is "not currently available": Developing a Multidimensional Data Model.


Solution

  • You probably already know that there are 2 different OLAP approaches:

    • MOLAP that requires data load step to process possible aggregations (previously defined as 'cubes'). Internally MOLAP-based solution pre-calculates measures for possible aggregations, and as result it is able to execute OLAP queries very fast. Most important drawbacks of this approach come from the fact that MOLAP acts as a cache: you need to re-load input data to refresh a cube (this can take a lot of time - say, hours), and full reprocessing is needed if you decide to add new dimensions/measures to your cubes. Also, there is a natural limit of the dataset size + cube configuration.
    • ROLAP doesn't try to pre-process input data; instead of that it translates OLAP query to database aggregate query to calculate values on-the-fly. "R" means relational, but approach can be used even with NoSQL databases that support aggregate queries (say, MongoDb). Since there is no any data cache users always get actual data (on contrast with MOLAP), but DB should able to execute aggregate queries rather fast. For relatively small datasets usual OLTP databases could work fine (SQL Server, PostgreSql, MySql etc), but in case of large datasets specialized DB engines (like Amazon Redshift) are used; they support efficient distributed usage scenario and able to processes many TB in seconds.

    Nowadays it is a little sense to develop MOLAP solution; this approach was actual >10 years ago when servers were limited by small amount of RAM and SQL database on HDD wasn't able to process GROUP BY queries fast enough - and MOLAP was only way to get really 'online analytical processing'. Currently we have very fast NVMe SSD, and servers could have hundreds gigabytes of RAM and tens of CPU cores, so for relatively small database (up to TB or a bit more) usual OLTP databases could work as ROLAP backend fast enough (execute queries in seconds); in case of really big data MOLAP is almost unusable in any way, and specialized distributed database should be used in any way.