Question:
Is it possible to create an MD5 digest between IICS and Snowflake that both match and can be productionized, as seen in the column outpus for MD5_ZERO and MD5_FIVE in the code snippet below?
Details:
I am working to determine record updates by hashing column inputs on NUMERICS. The datawarehouse is Snowflake and the ETL tool is Informatica's IICS Data Integration (i.e., mapping).
Basically, IICS is creating an MD5 digest that only matches Snowflake when the input to MD5 is the numeric string representation that I type in manually (i.e., hard-coded). But, this does not work for something that I am trying to productionize where the function hashes values from the table rather than what I hard-code.
I have tried several variations and permutations in both Snowflake and IICS, but for the life of me, I cannot get the two to create a MD5 digest that matches. I checked the documentation for both Snowflake and IICs (see links below) without success.
Example:
For example, when hashing 5.000000
defined as NUMERIC(38,6), I get the following digests in a productionized setting:
E4DA3B7FBBCE2345D7772B0674A318D5
E4DA3B7FBBCE2345D7772B0674A318D5
- MD5_FIVE <-- I need this463DAAEAA59E931EC54B0D6375A05A0F
- MD5_BAR <-- But I get thisWhat is happening here is that the MD5 in Snowflake is different when hashing a defined NUMERIC(38,6) value (i.e., 5.000000) that is stored on the table itself than when just manually typing it into the MD5() function itself. So, something like the query below yields a different MD5, which is unexpected - they should be the same.
SELECT MD5(TO_CHAR(FOO)) AS MD5_FOO
,MD5(5.000000) AS MD5_FIVE
FROM TMP
I need Snowflake to generate an MD5 digest that is equal to the output of MD5(5.000000), but as read from the table and not hard-coded into the function itself.
Alternatively, is this potentially a bug with Snowflake's MD5 hashing function?
Troubleshooting example here:
CREATE TABLE TMP (
FOO NUMBER(38,6),
BAR NUMBER(38,6)
);
INSERT INTO TMP VALUES(0, 5)
;
-- Note: When copying the field value from the table, the INSERTED values now have trailing zeroes.
SELECT *
,MD5(0.000000) AS MD5_ZERO
,MD5(5.000000) AS MD5_FIVE
,MD5(TO_VARCHAR(FOO::VARCHAR(38))) AS MD5_FOO
,MD5(TO_VARCHAR(BAR::VARCHAR(38))) AS MD5_BAR
,MD5(TO_CHAR(BAR, '">"$99.0"<"')) AS D2_1
,MD5(TO_CHAR(BAR, '">"00000000000000000000000000000000.000000"<"')) AS D2_2
,MD5(TO_CHAR(BAR, '">"00000000000000000000000000000000.000000"<"')) AS D2_3
,MD5(TO_CHAR(BAR, '">"B9,999.0"<"')) AS D4_1
,MD5(TO_CHAR(BAR, '">"TM"<"')) AS TM
,MD5(TO_CHAR(BAR, '">"TME"<"')) AS TME
,MD5(TO_CHAR(BAR, '">"TM9"<"')) AS TM9
,MD5(TO_CHAR(BAR, '">"0XXX"<"')) AS X4
,MD5(TO_CHAR(BAR, '">"S0XXX"<"')) AS SX4
FROM TMP
;
https://docs.snowflake.com/en/sql-reference/functions/to_char#examples-that-convert-numbers
The issue is with inconsistent inputs to the hashing function.
Using your troubleshooting example as a starting point, here's a table with two columns, one of type NUMBER
with default precision and scale ((38,0)
) and the other with type NUMBER
with explicitly defined precision and scale ((38,6)
). Two rows are added, adding 0, 5 to each column.
CREATE OR REPLACE TEMPORARY TABLE tmp (
num_default NUMBER,
num_custom NUMBER(38,6)
);
INSERT INTO tmp VALUES(0, 0);
INSERT INTO tmp VALUES(5, 5);
Here is the result of calculating the MD5 digest on each value, with the input shown alongside each result. Two things to note:
MD5()
functionSELECT
num_default,
MD5(num_default) AS md5_defualt,
num_custom,
MD5(num_custom) AS md5_custom,
FROM tmp;
NUM_DEFAULT | MD5_DEFUALT | NUM_CUSTOM | MD5_CUSTOM |
---|---|---|---|
0 | cfcd208495d565ef66e7dff9f98764da | 0.000000 | 2d0d659d374801a71f65622406b26458 |
5 | e4da3b7fbbce2345d7772b0674a318d5 | 5.000000 | 463daaeaa59e931ec54b0d6375a05a0f |
Here's a query that explicitly converts the values using TO_CHAR()
and specifies the output format. FM
is used to toggle the fill mode to "compact" and TM9
is used to output numbers in "text minimal" format. The former causes the latter to omit trailing zeroes.
See https://docs.snowflake.com/en/sql-reference/sql-format-models#text-minimal-numeric-formats and https://docs.snowflake.com/en/sql-reference/sql-format-models#format-modifiers-and-generic-space-handling (I don't know why the "Note" at the end says that fill mode has no effect on text minimal format elements, that's simply not true and contradicts other points in the documentation as well as the actual system behavior).
SELECT
TO_CHAR(num_default, 'FMTM9') AS default_fmtm9,
MD5(TO_CHAR(num_default, 'FMTM9')) AS md5_default_fmtm9,
TO_CHAR(num_custom, 'FMTM9') AS custom_fmtm9,
MD5(TO_CHAR(num_custom, 'FMTM9')) AS md5_custom_fmtm9,
FROM tmp;
DEFAULT_FMTM9 | MD5_DEFAULT_FMTM9 | CUSTOM_FMTM9 | MD5_CUSTOM_FMTM9 |
---|---|---|---|
0 | cfcd208495d565ef66e7dff9f98764da | 0 | cfcd208495d565ef66e7dff9f98764da |
5 | e4da3b7fbbce2345d7772b0674a318d5 | 5 | e4da3b7fbbce2345d7772b0674a318d5 |
Here are some additional queries to demonstrate how numeric literals are processed by default (i.e. as NUMBER(38,0)
) and what happens when you explicitly cast a literal to NUMBER(38,6)
).
SELECT
5,
MD5(5) AS md5_five_literal;
5 | MD5_FIVE_LITERAL |
---|---|
5 | e4da3b7fbbce2345d7772b0674a318d5 |
SELECT
5.000000,
MD5(5.000000) AS md5_five_literal;
5.000000 | MD5_FIVE_LITERAL |
---|---|
5 | e4da3b7fbbce2345d7772b0674a318d5 |
SELECT
5.000000::NUMBER(38,6),
MD5(5.000000::NUMBER(38,6)) AS md5_five_literal_casted;
5.000000::NUMBER(38,6) | MD5_FIVE_LITERAL_CASTED |
---|---|
5.000000 | 463daaeaa59e931ec54b0d6375a05a0f |
The main takeaway from all this is that you need to focus on what exactly is being input into the hashing function. You should strive to remove all ambiguity by performing explicit conversions to the string expression that the Snowflake MD5()
function accepts.
For reference, see https://docs.snowflake.com/sql-reference/data-type-conversion and https://docs.snowflake.com/en/sql-reference/functions/md5.
Alternatively, is this potentially a bug with Snowflake's MD5 hashing function?
There's nothing wrong with the MD5 implementation. This same behavior can be illustrated in a different environment. For example, using Go:
package main
import (
"crypto/md5"
"fmt"
)
func printMD5(input string) {
inputBytes := []byte(input)
fmt.Printf("input = %q\n", input)
fmt.Printf("inputBytes = %x\n", inputBytes)
fmt.Printf("md5(inputBytes) = %x\n\n", md5.Sum(inputBytes))
}
func main() {
printMD5("5")
printMD5("5.000000")
}
Output:
input = "5"
inputBytes = 35
md5(inputBytes) = e4da3b7fbbce2345d7772b0674a318d5
input = "5.000000"
inputBytes = 352e303030303030
md5(inputBytes) = 463daaeaa59e931ec54b0d6375a05a0f