Search code examples
sqlhivequery-optimizationdistincthiveql

Hive multiple distinct on query running slow?


Why is the below query taking forever to complete whereas if queried separately returns within seconds?

select count(*),count(distinct a), count(distinct b) from test.tablename;

However, if queried either of the below two queries it returns in seconds. e.g.

select count(*),count(distinct a) from test.tablename; 

or

select count(*),count(distinct b) from test.tablename;

anything to speed up? Already ran a analyze for compute stats. hive configurations are mostly optimized.

UPDATE: apologies. missed the count in the question.


Solution

  • Alternative approach if you do not have too big counts (too big arrays will cause OOM). size(collect_set()) will give you the distinct count.

    select count(*), size(collect_set(a)), size(collect_set(b)) from test.tablename;