Search code examples
cryptographysnowflake-cloud-data-platformmd5informaticainformatica-cloud

Why do Snowflake and Informatica (IICS) MD5 function yield a different digest for the same input NUMERIC?


Question:

Is it possible to create an MD5 digest between IICS and Snowflake that both match and can be productionized, as seen in the column outpus for MD5_ZERO and MD5_FIVE in the code snippet below?

Details:

I am working to determine record updates by hashing column inputs on NUMERICS. The datawarehouse is Snowflake and the ETL tool is Informatica's IICS Data Integration (i.e., mapping).

Basically, IICS is creating an MD5 digest that only matches Snowflake when the input to MD5 is the numeric string representation that I type in manually (i.e., hard-coded). But, this does not work for something that I am trying to productionize where the function hashes values from the table rather than what I hard-code.

I have tried several variations and permutations in both Snowflake and IICS, but for the life of me, I cannot get the two to create a MD5 digest that matches. I checked the documentation for both Snowflake and IICs (see links below) without success.

Example:

For example, when hashing 5.000000 defined as NUMERIC(38,6), I get the following digests in a productionized setting:

  • IICS: E4DA3B7FBBCE2345D7772B0674A318D5
  • Snowflake: E4DA3B7FBBCE2345D7772B0674A318D5 - MD5_FIVE <-- I need this
  • Snowflake: 463DAAEAA59E931EC54B0D6375A05A0F - MD5_BAR <-- But I get this

What is happening here is that the MD5 in Snowflake is different when hashing a defined NUMERIC(38,6) value (i.e., 5.000000) that is stored on the table itself than when just manually typing it into the MD5() function itself. So, something like the query below yields a different MD5, which is unexpected - they should be the same.

SELECT MD5(TO_CHAR(FOO)) AS MD5_FOO
    ,MD5(5.000000) AS MD5_FIVE
FROM TMP

I need Snowflake to generate an MD5 digest that is equal to the output of MD5(5.000000), but as read from the table and not hard-coded into the function itself.

Alternatively, is this potentially a bug with Snowflake's MD5 hashing function?

Troubleshooting example here:

CREATE TABLE TMP (
    FOO NUMBER(38,6),
    BAR NUMBER(38,6)
);
INSERT INTO TMP VALUES(0, 5)
;
-- Note: When copying the field value from the table, the INSERTED values now have trailing zeroes.
SELECT *
    ,MD5(0.000000) AS MD5_ZERO
    ,MD5(5.000000) AS MD5_FIVE
    ,MD5(TO_VARCHAR(FOO::VARCHAR(38))) AS MD5_FOO
    ,MD5(TO_VARCHAR(BAR::VARCHAR(38))) AS MD5_BAR
    ,MD5(TO_CHAR(BAR, '">"$99.0"<"')) AS D2_1
    ,MD5(TO_CHAR(BAR, '">"00000000000000000000000000000000.000000"<"')) AS D2_2
    ,MD5(TO_CHAR(BAR, '">"00000000000000000000000000000000.000000"<"')) AS D2_3
    ,MD5(TO_CHAR(BAR, '">"B9,999.0"<"')) AS D4_1
    ,MD5(TO_CHAR(BAR, '">"TM"<"')) AS TM
    ,MD5(TO_CHAR(BAR, '">"TME"<"')) AS TME
    ,MD5(TO_CHAR(BAR, '">"TM9"<"')) AS TM9
    ,MD5(TO_CHAR(BAR, '">"0XXX"<"')) AS X4
    ,MD5(TO_CHAR(BAR, '">"S0XXX"<"')) AS SX4
FROM TMP
;

https://docs.snowflake.com/en/sql-reference/functions/to_char#examples-that-convert-numbers

https://docs.informatica.com/data-integration/powercenter/10-5/transformation-language-reference/functions/to_char--numbers-.html


Solution

  • The issue is with inconsistent inputs to the hashing function.

    Using your troubleshooting example as a starting point, here's a table with two columns, one of type NUMBER with default precision and scale ((38,0)) and the other with type NUMBER with explicitly defined precision and scale ((38,6)). Two rows are added, adding 0, 5 to each column.

    CREATE OR REPLACE TEMPORARY TABLE tmp (
        num_default NUMBER,
        num_custom NUMBER(38,6)
    );
    INSERT INTO tmp VALUES(0, 0);
    INSERT INTO tmp VALUES(5, 5);
    

    Here is the result of calculating the MD5 digest on each value, with the input shown alongside each result. Two things to note:

    • the values are being silently coerced into a string expression, which is the type of input accepted by the MD5() function
    • anytime the input is different in any way, the result is going to be different
    SELECT 
        num_default,
        MD5(num_default) AS md5_defualt,
        num_custom,
        MD5(num_custom) AS md5_custom,
    FROM tmp;
    
    NUM_DEFAULT MD5_DEFUALT NUM_CUSTOM MD5_CUSTOM
    0 cfcd208495d565ef66e7dff9f98764da 0.000000 2d0d659d374801a71f65622406b26458
    5 e4da3b7fbbce2345d7772b0674a318d5 5.000000 463daaeaa59e931ec54b0d6375a05a0f

    Here's a query that explicitly converts the values using TO_CHAR() and specifies the output format. FM is used to toggle the fill mode to "compact" and TM9 is used to output numbers in "text minimal" format. The former causes the latter to omit trailing zeroes.

    See https://docs.snowflake.com/en/sql-reference/sql-format-models#text-minimal-numeric-formats and https://docs.snowflake.com/en/sql-reference/sql-format-models#format-modifiers-and-generic-space-handling (I don't know why the "Note" at the end says that fill mode has no effect on text minimal format elements, that's simply not true and contradicts other points in the documentation as well as the actual system behavior).

    SELECT
        TO_CHAR(num_default, 'FMTM9') AS default_fmtm9,
        MD5(TO_CHAR(num_default, 'FMTM9')) AS md5_default_fmtm9,
        TO_CHAR(num_custom, 'FMTM9') AS custom_fmtm9,
        MD5(TO_CHAR(num_custom, 'FMTM9')) AS md5_custom_fmtm9,
    FROM tmp;
    
    DEFAULT_FMTM9 MD5_DEFAULT_FMTM9 CUSTOM_FMTM9 MD5_CUSTOM_FMTM9
    0 cfcd208495d565ef66e7dff9f98764da 0 cfcd208495d565ef66e7dff9f98764da
    5 e4da3b7fbbce2345d7772b0674a318d5 5 e4da3b7fbbce2345d7772b0674a318d5

    Here are some additional queries to demonstrate how numeric literals are processed by default (i.e. as NUMBER(38,0)) and what happens when you explicitly cast a literal to NUMBER(38,6)).

    SELECT
        5,
        MD5(5) AS md5_five_literal;
    
    5 MD5_FIVE_LITERAL
    5 e4da3b7fbbce2345d7772b0674a318d5
    SELECT
        5.000000,
        MD5(5.000000) AS md5_five_literal;
    
    5.000000 MD5_FIVE_LITERAL
    5 e4da3b7fbbce2345d7772b0674a318d5
    SELECT
        5.000000::NUMBER(38,6),
        MD5(5.000000::NUMBER(38,6)) AS md5_five_literal_casted;
    
    5.000000::NUMBER(38,6) MD5_FIVE_LITERAL_CASTED
    5.000000 463daaeaa59e931ec54b0d6375a05a0f

    The main takeaway from all this is that you need to focus on what exactly is being input into the hashing function. You should strive to remove all ambiguity by performing explicit conversions to the string expression that the Snowflake MD5() function accepts.

    For reference, see https://docs.snowflake.com/sql-reference/data-type-conversion and https://docs.snowflake.com/en/sql-reference/functions/md5.

    Alternatively, is this potentially a bug with Snowflake's MD5 hashing function?

    There's nothing wrong with the MD5 implementation. This same behavior can be illustrated in a different environment. For example, using Go:

    package main
    
    import (
        "crypto/md5"
        "fmt"
    )
    
    func printMD5(input string) {
        inputBytes := []byte(input)
        fmt.Printf("input = %q\n", input)
        fmt.Printf("inputBytes = %x\n", inputBytes)
        fmt.Printf("md5(inputBytes) = %x\n\n", md5.Sum(inputBytes))
    }
    
    func main() {
        printMD5("5")
        printMD5("5.000000")
    }
    

    Output:

    input = "5"
    inputBytes = 35
    md5(inputBytes) = e4da3b7fbbce2345d7772b0674a318d5
    
    input = "5.000000"
    inputBytes = 352e303030303030
    md5(inputBytes) = 463daaeaa59e931ec54b0d6375a05a0f