Search code examples
scalaapache-sparkhiveqlsqlperformance

Which is more efficient, max or order by desc limit 1 in HIVE using spark version 2


As Hive keeps the data in distributed, Which query will be more efficient out of below two, when we have not consider that column in partition by or in bucketing.

  1. select max(stat_id) from stats_tbl ;
  2. select stat_id from stats_tbl order by stat_id desc limit 1;

Solution

  • Definitely select max(stat_id) from stats_tbl because order by requires gathering (read "lots of shuffle") all the data into a single reducer (and that's why you have to supply a limit clause with it) which will be inefficient compared to an aggregate function that can be computed distributedly.