I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as
SELECT A, min_by(B, C) from <TABLE> GROUP BY A
Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following
SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C\
GROUP BY A
I have 2 problems with this solution
Is there a way to achieve what I am trying to do without writing your custom UDF ?
Join works slower than analytic functions, try this approach without join, and table will be scanned only once:
select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C,
min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;
If you need min(C) to be calculated by more group columns, add them to the partition BY
clause.