Search code examples
hadoophivehiveqlhadoop2

Hive Getting error on group by column while using case statements and aggregations


I am working on a query in hive. In that I am using aggregations like sum and case statements and group by clause. I have changed the column names and table names but my logic is same which I was using in my project

select 
empname,
empsal, 
emphike,
sum(empsal) as tot_sal,
sum(emphike) as tot_hike,
case when tot_sal > 1000 then exp(tot_hike)
else 0
end as manager
from employee
group by 
empname,
empsal,
emphike

For the above query I was getting error as "Expression not in group by key '1000'". So I have slightly modified the query and tried again My other query is

select 
empname,
empsal, 
emphike,
sum(empsal) as tot_sal,
sum(emphike) as tot_hike,
case when sum(empsal) > 1000 then exp(sum(emphike))
else 0
end as manager
from employee
group by 
empname,
empsal,
emphike

For above query its putting me error as "Expression not in group by key 'Manager'". When I add manager in the group by its showing invalid alias. Please help me out here


Solution

  • I see three issues in your query:

    1.) Hive cannot group by a variable you defined in the select block by the name you gave it right away. You will probably need a subquery for that.

    2.) Hive tends to show errors when sum or count operations are not at the end of the query.

    3.) Although I do not know what your goal is, I think that your query will not deliver the desired result. If you group by empsal there would be no difference between empsal and sum(empsal) by design. Same goes for emphike and sum(emphike).

    I think the following query might solve these issues:

    select
    a.empname,
    a.tot_sal, 
    a.tot_hike,
    if(a.tot_sal > 1000, exp(a.tot_hike), 0) as manager
    from
    (select 
    empname,
    sum(empsal) as tot_sal,
    sum(emphike) as tot_hike,
    from employee
    group by 
    empname
    )a
    

    The if statement is equivalent to your case statement, however I find it a bit easier to read.

    In this example you wouldn't need to group by after the subquery because the grouping is done in the subquery a.