Search code examples
hivequery-optimization

Hive query tuning want to get rid of union all


I want to tune a query that unions all based on each rank.

= current query

select
        '111.111.111.111' agent_ip,
        tbl.source_ip,
        sum(tbl.cnt) t_count
from
        event tbl
where

tbl.agent_ip    = '111.111.111.111'

group by
        tbl.source_ip
order by
        t_count desc limit 3  
union all
select
        '222.222.222.222' agent_ip,
        tbl.source_ip,
        sum(tbl.cnt) t_count
from
        event tbl
where

tbl.agent_ip    = '222.222.222.222'

group by
        tbl.source_ip
order by
        t_count desc limit 3  
                


[result] 
111.111.111.111 39.126.169.47   1
111.111.111.111 45.67.97.19 1
222.222.222.222 45.134.144.113  222
222.222.222.222 94.102.61.31    111
222.222.222.222 146.88.240.4    4   

When I use the query above, the question and answer are too long. Union all seems to be a problem, is there any way to reduce it?

I hope you don't do each query separately...

tbl.agent_ip in ('111.111.111.111', '222.222.222.222')

I guess there must be a way to do this


Solution

  • You can give below SQL a try -

    select agent_ip,source_ip,t_count 
    from (
    select subqry1.*
    row_number() over( partition by agent_ip order by agent_ip, t_count) as rn -- create a rownumber for each row
    from (
    select  -- select base data
            agent_ip,
            source_ip,
            sum(tbl.cnt) t_count
    from event tbl
    WHERE tbl.agent_ip in ('111.111.111.111', '222.222.222.222')
    group by 1,2) subqry1
    ) subqry2
    Where rn<=3  -- to select top 3 rows for agent_ip
    

    Explanation is in the SQL. I captured both queries in one. Should be faster than union all but i dont know your data volume or partition scheme so not 100% sure.

    It has 2 stages -

    1. compute base data using subquery 1.
    2. Give them rank for each agent IP.
    3. Then pick top 3 count.