Search code examples
hadoophivehqlhiveqlcloudera

Hive: How to output total row count as a variable


I have a dataset that I'm de-duping with the following code:

select session_id, sol_id, id, session_context_code, date
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id, date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
        order by session_id, sol_id, date

I want to add a variable that stores the total count of rows after dedup, and I tried with count(*):

select session_id, sol_id, id, session_context_code, date,count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
        order by session_id, sol_id, date

The error I received:

ERROR: Execute error: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10025]: Line 1:44 Expression not in GROUP BY key 'session_id'

I just want to output a count as a variable that counts all distinct records by session_id and sol_id after de-duped by row number. How do I incorporate that to the code?

Based on Gomz's suggestion, but received error:

ERROR: Execute error: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:614 missing EOF at 'group' near 'nifi_date'

Code:

select session_id, solicit_id, nifi_date,id, session_context_code,count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id) as rn,
        substr(case_id,2,9) as id
        from df.t1_data
         )undup
        where undup.rn =1 and 
        session_context_code in ("4","3") and
        order by session_id, sol_id, nifi_date
        group by session_id, sol_id, nifi_date,id, session_context_code

Solution

  • A Hive query with COUNT(*) along with columns in SELECT clause should have these columns grouped at the end with GROUP BY.

    Some samples:

    SELECT COUNT(*) FROM employees;

    SELECT id, name, COUNT(*) FROM employees GROUP BY id, name;

    In your issue scenario, the query should look like below,

    select session_id, sol_id, id, session_context_code, count(*) as total
        from (
            select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
            substr(case_id,2,9) as id
    
            from df.t1_data
             )undup
            where undup.rn =1 
    GROUP BY session_id, sol_id, id, session_context_code
            order by session_id, sol_id, date
    

    You can read more HERE

    Update: If you want to count all distinct records only by session_id and sol_id, then the query can be as follows,

    select session_id, sol_id, count(*) as total
        from (
            select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
            substr(case_id,2,9) as id
    
            from df.t1_data
             )undup
            where undup.rn =1 
    GROUP BY session_id, sol_id
            order by session_id, sol_id, date;
    

    As discussed, you can use only the columns you need to be counted in SELECT and GROUP BY.

    If you need the results with multiple columns more than what needs to be counted, you can create a temporary table with only the columns those are counted and join with the original table. i.e., if you need the columns c,d,e,f as well from the table even though you need the count of columns a, b then you can do something like below,

    CREATE TABLE tmp AS 
    SELECT a, b, count(*)
    FROM table1
    GROUP BY a,b;
    

    Do a JOIN between tmp and table1 on columns a, b

    SELECT y.a, y.b, x.c, x.d, x.e, x.f
    FROM tmp y, table1 x
    WHERE y.a=x.a
    AND y.b=x.b;
    

    Hope this helps!