Search code examples
sqlamazon-athenaprestotrino

Calculating hash integer from a string in Athena


I'm trying to calculate a hash from a string for best-effort ordering and partioning purposes in Athena. There is no String to hashCode() similar in Athena, so as a best effort, I try to get the 2nd character and calculate its codepoint and get the modulus. (As I said, best effort, maybe a nice effort)

Consider the query:

SELECT
    doc_id,
    substring(doc_id, 2, 1),
    typeof(split(substring(doc_id, 2, 1)))
FROM events LIMIT 100

The 3rd row returns a varchar but the codepoint function expects a varchar(1) and casting it does not work as cast(substring(doc_id, 2, 1) as varchar(1)).

FUNCTION_NOT_FOUND: line 6:5: Unexpected parameters (varchar) for function codepoint. Expected: codepoint(varchar(1))

How can I accomplish this task without modifiying the data source? I'm open to ideas.


Solution

  • You can compute a hash code with the xxhash64 function. It takes a varbinary as input, so first cast the string to that type. Since the function also returns a 64-bit varbinary value, you can convert it to a bigint via the from_big_endian_64 function

    WITH t(x) AS (VALUES 'hello')
    SELECT from_big_endian_64(xxhash64(cast(x AS varbinary)))
    FROM t
    

    output:

            _col0
    ---------------------
     2794345569481354659
    (1 row)