Search code examples
sql-serverhashmd5hashbytes

Converting Varbinary to Char and again to Varbinary for Primary key


I came across a sql code, which creates primary keys with hashbytes function and md5 algorithm. The code looks like this:

SELECT
     CONVERT(VARBINARY(32),
        CONVERT( CHAR(32),
            HASHBYTES('MD5', 
            (LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
            ),
        2)
    ) 
FROM database.schema.table

I find it hard to understand for what is the result from hashbytes function is converted to char and then to varbinary, when we get directly varbinary from hashbytes function. Is there any good reason to do so?


Solution

  • Short Version

    This code pads a hash with 0x20 bytes which is rather strange and most likely due to misunderstandings by the initial author. Using hashes as keys is a terrible idea anyway

    Long Version

    Hashes are completely inappropriate for generating primary keys. In fact, since the same hash can be generated from different original data, this code is guaranteed to produce duplicate values, causing collisions at best.

    Worst case, you end up updating or deleting the wrong row, resulting in data loss. In fact, given that MD5 was broken over 20 years ago, one can calculate the values that would result in collisions. This has been used to hack systems in the past and even generate rogue CA certificates as far back as 2008.

    And even worse, the concatenation expression :

    (LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
    

    Will create the same initial string for multiple different column values.

    On top of that, given the random nature of hash values, this results in table fragmentation and an index that can't be used for range queries. Primary keys most of the time are clustered keys as well, which means they specify the order rows are stored on disk. Using essentially random values for a PK means new rows can be added at the middle or even the start of a table's data pages.

    This also harms caching, as data is loaded from disk in pages. With a meaningful clustered key, it's highly likely that loading a specific row will also load rows that will be needed very soon. Loading eg 50 rows while paging may only need to load a single page. With an essentially random key, you could end up loading 50 pages.

    Using a GUID generated with NEWID() would provide a key value without collisions. Using NEWSEQUENTIALID() would generate sequential GUID values eliminating fragmentation and once again allowing range searches.

    An even better solution would be to just create a PK from the two columns :

    ALTER TABLE ThatTable ADD PRIMARY KEY (Column1,Column2);
    

    Or just add an IDENTITY-generated ID column. A bigint is large enough to handle all scenarios :

    Create ThatTable (
        ID bigint NOT NULL IDENTITY(1,1) PRIMARY KEY,
        ...
    )
    

    If the intention was to ignore spaces in column values there are better options:

    • The easiest solution would be to clean up the values when inserting them.
    • A CHECK constraint can be added to each column to ensure the columns can't have leading or trailing spaces.
    • An INSTEAD OF trigger can be used to trim them.
    • Computed, persisted columns can be added that trim the originals, eg Column1_Cleaned as TRIM(Column1) PERSISTED. Persisted columns can be used in indexes and primary keys

    As for what it does:

    1. It generates deprecation warnings (MD5 is deprecated)
    2. It pads the MD5 hash with 0x20 bytes. A rather ... unusual way of padding data. I suspect whoever first wrote this wanted to pad the hash to 32 bytes but used some copy-pasta code without understanding the implications.

    You can check the results by hashing any value. The following queries

    select hashbytes('md5','banana')
    ----------------------------------
    0x72B302BF297A228A75730123EFEF7C41
    
    select cast(hashbytes('md5','banana') as char(32))
    --------------------------------
    r³¿)z"Šus#ïï|A                
    

    A space in ASCII is the byte 0x20. Casting to binary replaces spaces with 0x20, not 0x00

    select cast(cast(hashbytes('md5','banana') as char(32)) as varbinary(32))
    ------------------------------------------------------------------
    0x72B302BF297A228A75730123EFEF7C4120202020202020202020202020202020
    

    If one wanted to pad a 16-byte value to 32 bytes, it would make more sense to use 0x00. The result is no better than the original though

    select cast(hashbytes('md5','banana') as binary(32))
    ------------------------------------------------------------------
    0x72B302BF297A228A75730123EFEF7C4100000000000000000000000000000000
    

    To get a real 32-byte hash, SHA2_256 can be used :

    select hashbytes('sha2_256','banana')
    ------------------------------------------------------------------
    0xB493D48364AFE44D11C0165CF470A4164D1E2609911EF998BE868D46ADE3DE4E