Search code examples
apache-sparkhiverollupgrouping-sets

Can we do rollup or cube without exploding rows in hive/spark


I'm aggregating 4 dimensions on a hive table (tens of billions of rows), and they need to be rolled up or cubed. Say my table is about user interactions, and I will aggregate on what mobile application they use, what mobile os they have, etc. The sql looks like this:

select d1, d2, d3, d4, count(distinct userId) as uv, sum(1) as pv from T group by cube(d1, d2, d3, d4)

Since the current implementation of grouping sets / rollup / cube explode the input (according to this jira), resulting in total 16 passes over the input, which is quite resource consuming.

My custom solution to this is to have a temporary table where I aggregate on user id and the 4 dimensions first, then do the rollup.

create table tmp as select userId, d1, d2, d3, d4, sum(1) as pv from T group by userId, d1, d2, d3, d4

then

select d1, d2, d3, d4, count(1) as uv, sum(pv) as pv from tmp group by cube(d1, d2, d3, d4)

Since the input data is huge, doing pre-aggreagates has significant reduce in runtime (reducing the input data size of the 16 passes).

So I wonder if hive/spark can have this rule built in, or people would normally deal with this problem otherwise that I'm not aware of?


Solution

  • No, Spark-SQL Catalyst don't have any rule for Pre-Aggregates calculation. This is because neither hcatalogue nor spark-sql catalogue stores hierarchy information & thats why it won't have this functionality as of now.

    Usually OLAP tools gives this functionality. They store the hierarchy details and based on hierarchy it computes the aggregates in OLAP cube.

    If you're looking for OLAP functionalities then probably you should explore Snowflake or Kyvosinsights