As Hive keeps the data in distributed, Which query will be more efficient out of below two, when we have not consider that column in partition by or in bucketing.
select max(stat_id) from stats_tbl ;
select stat_id from stats_tbl order by stat_id desc limit 1;
Definitely select max(stat_id) from stats_tbl
because order by
requires gathering (read "lots of shuffle") all the data into a single reducer (and that's why you have to supply a limit
clause with it) which will be inefficient compared to an aggregate function that can be computed distributedly.