Search code examples
sqlhadoophivehiveqlrank

Using RANK in HiveQL, dynamic limits


Trying to perform a dynamic limit in hive sql via the rank function.

PROBLEM: I want to use the limit from table A against table B to create the output. Example below.

TABLE A:

 ID   | Limit
 ------------
 123  | 1
 456  | 3
 789  | 2

TABLE B:

ID   | User
-------
123  | ABC
123  | DEF
123  | GHI
456  | JKL
456  | MNO
789  | PQR
789  | RST

OUTPUT:

ID   | User
----------
123  | ABC
456  | JKL
456  | MNO
789  | PQR
789  | RST

Unfortunately you cannot do a dynamic limit (as far as I know) in hive sql. So I was trying to use rank. My current query looks like this:

SELECT c.id, c.users, c.rnk
FROM (
    SELECT b.id, b.user, a.limit, rank() over (ORDER BY b.id DESC) as rnk
    FROM a JOIN b
    ON a.id = b.id
    ) c
WHERE rnk < c.limit;

Currently I get the error:

ParseException line 3:9 cannot recognize input near 'rank' '(' ')' in from source 0

Any ideas why? Or maybe a better approach?

Thanks!


Solution

  • SELECT c.id, c.users, c.rnk
    FROM (
        SELECT b.id, b.user, a.limit, row_number() over (PARTITION by b.id ORDER BY b.id ) as rn
        FROM a JOIN b
        ON a.id = b.id
        ) c
    WHERE rn <= c.limit;
    

    In the above query row_number() will number rows after join, filter in the where clause will work as limit. ORDER BY is not necessary for simply limiting rows without any preference, replace ORDER BY with your rule, for example order by user.