Search code examples
sqlhivecloudera

Hive [Error 10025]: Expression not in GROUP BY key name


I'm trying to select records defined by key-terms in a variable within a group.

name is a string with key terms interested.

groups are defined by the combo of id1 and id2.

I'm interested in extract records by groups that contain the key terms.

select id1, id2, name
   case
    when name LIKE '%LOAD_TIME' then 1
    when name LIKE '%LOGIN_SESSION_TIME' then 1
   end as b_flag
   from df1
   group by id1, id2
   having (sum(b_flag) > 0 )

df1:

id1  id2  name                               
1     1    xxxLOAD_TIME
1     1    xxxLOGIN_LOGIN_SESSION_TIMExxx
1     1    xxxxSome other timexxxx
2     2    xxSome other timex
3     1    xxxLOAD_TIME
3     1    xxSome other timexx

After creating b_flag the new dataset should look like:

id1  id2  name                             b_flag   
1     1    xxxLOAD_TIME                      1
1     1    xxxLOGIN_LOGIN_SESSION_TIMExxx    1
1     1    xxxxSome other timexxxx   
2     2    xxSome other timex
3     1    xxxLOAD_TIME                      1
3     1    xxSome other timexx

Desired output:

   id1  id2  name                             b_flag   
    1     1    xxxLOAD_TIME                      1
    1     1    xxxLOGIN_LOGIN_SESSION_TIMExxx    1
    1     1    xxxxSome other timexxxx   
    3     1    xxxLOAD_TIME                      1
    3     1    xxSome other timexx

I do not see what is wrong with my code, but I have been getting the same error:

[Error 10025]: Expression not in GROUP BY key name

Thank you for any help


Solution

  • You could do this with window functions:

    select id1, id2, name, b_flag
    from (
        select 
            t.*, 
            case when name LIKE '%LOAD_TIME' or name LIKE '%LOGIN_SESSION_TIME' then 1 end b_flag,
            sum(case when name LIKE '%LOAD_TIME' or name LIKE '%LOGIN_SESSION_TIME' then 1 end) 
                over(partition by id1, id2) sum_b_flag
        from mytable t
    ) t
    where sum_b_flag > 0
    

    The inner query checks if the current row matches the criteria, and computes a window sum for the flag over records having the same (id1, id2).

    If you don't want to repeat the expression that computes the flag, you can use an additional subquery:

    select id1, id2, name, b_flag
    from (
        select t.*, sum(b_flag) over(partition by id1, id2) sum_b_flag
        from (
            select 
                t.*, 
                case when name LIKE '%LOAD_TIME' or name LIKE '%LOGIN_SESSION_TIME' then 1 end b_flag,
            from mytable t
        ) t
    ) t
    where sum_b_flag > 0