Search code examples
sqlimpala

Incremental integer ID in Impala


I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which

Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.

Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.

The desired result should look like:

-- UUID() provided as example - I want to achieve the `my_id`-column.

| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1     | 50d53ca4-b...| "a"          |
| 2     | 6ba8dd54-1...| "b"          |
| 3     | 515362df-f...| "c"          |
| 4     | a52db5e9-e...| "d"          |
|-------|--------------|--------------|

How can I achieve the desired result (integer-ID ranging from 1..n)?


Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.


Solution

  • Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:

    SELECT
        row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
        , some_content
    FROM some_table
    
    • row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
    • PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
    • ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.

    The command creates the desired incremental ID without any nested SQL-statements or other tricks.

    | my_id | some_content |
    |-------|--------------|
    | 1     | "a"          |
    | 2     | "b"          |
    | 3     | "c"          |
    | 4     | "d"          |
    |-------|--------------|