I'm working on IBM Netezza/PureData and I want to add an identifier column to a table containing billions of new records every day, so that I can track each record as it travels through different tables and systems. I want this identifier to have more than 64 bits of entropy, as the column may contain (in time) more than 10^12 different records, and I want to avoid hash collisions. According to this table, 64 bits is not enough to avoid hash collisions with this number of records.
Not enough bits
So, on Netezza I can easily do
select hash8(123456) as id
which will return a 64-bit number (BIGINT):
-1789169473613552245
This is perfectly readable, but it has only 64-bits of entropy.
Not human readable
I can also do:
select hash(123456) as id
to create a 128-bit hash on Netezza. This has more than enough entropy, but it becomes an unreadable mess of unicode characters:
oð8^GþåíOpJ
I'm afraid this will cause trouble when I start combining this date with tables from other systems.
Enough bits and human readable?
So instead, I would like to create a human-readable identifier by, for example, converting this 128-bit unicode string to a base-62 string, containing only alphanumeric characters (0-9, a-z, A-Z). Something like:
6KMPOATg6Y5TbuEZlD59Dp
Any ideas on how to do this? Ideally with only (Netezza) SQL-code or functions...
Since you seem to have the SQL extensions available (hash() being one of the Netezza SQL extension functions), you could try doing 'rawtohex()' on the output of your hash()
e.g.
select rawtohex(hash(123456));
This gives a nice HEX string representation of the hashed data:
'E10ADC3949BA59ABBE56E057F20F883E'