Search code examples
sql-servert-sql

Combine table columns into one unique value (as input for HASHBYTES)


Objective

I want to combine the four columns of a row into one value, named COMBINED_VALUE.

These strict rules apply:

  • Two rows with the same column values in the same order must produce the same COMBINED_VALUE
  • No two different rows may produce the same COMBINED_VALUE
  • Order of columns makes a difference, trailing spaces make a difference, invisible characters make a difference, any difference makes a difference
  • All columns are nullable, one columns is defined as INT, three columns are defined as NVARCHAR(100)

Notes

Among the first things I tried was BINARY_CHECKSUM. It fails, resulting in the same value for different rows:

SELECT
    a, b,
    binary_checksum_failing = BINARY_CHECKSUM(a,b)
FROM
    (VALUES ('i01', CONVERT(DECIMAL(9,2), 100)),
            ('i01', CONVERT(DECIMAL(9,2), 10)),
            ('i01', CONVERT(DECIMAL(9,2),  1)) )v(a, b)

HASHBYTES with 'SHA2_256' works better, but it needs one well prepared COMBINED_VALUE as input. If that is ensured, it has a chance to produce "very" unique hashes. I found that concatenating the columns should work but NULLs need to be replaced and columns need to be separated.

Find below two pieces of code showing my current progress and further notes. The first part shows a solution based on conversion to NVARCHAR, the second part uses VARBINARY. The example rows show edge cases and try to provoke failure.

Questions

  • Can you make the presented scripts fail? Can you create two different rows, that produce the same COMBINED_VALUE while passing validation?
  • Is there a better way to get COMBINED_VALUE? Is there a faster way? What is best practise in general?

Script #1:

DROP TABLE IF EXISTS #t

CREATE TABLE #t
(
    id           INT IDENTITY(1,1)
    ,DESCRIPTION NVARCHAR(100)
    ,number      INT
    ,word1       NVARCHAR(100)
    ,word2       NVARCHAR(100)
    ,word3       NVARCHAR(100)
)

INSERT INTO #t(DESCRIPTION, number, word1, word2, word3)
VALUES
 ('base row'                          , 1,        'aa',        'aa',     'aa') 
,('trailing space in different places', 1,        'aa',        'aa',    'aa ') 
,('trailing space in different places', 1,        'aa',       'aa ',     'aa') 
,('NULLs in different places'         , 1,       NULL,         'aa',     'aa') 
,('NULLs in different places'         , 1,        'aa',        NULL,     'aa') 
,('different rows, screwery with NULL', 1,        'aa',        'aa',     NULL) 
,('different rows, screwery with NULL', 1,        'aa',        'aa', '*NULL*') 
,('different rows, screwery with sep.', 1, 'aa'       , 'aa/SEP\aa',     'aa') 
,('different rows, screwery with sep.', 1, 'aa/SEP\aa',        'aa',     'aa') 
,('longer values'
  , 1111111111, 'abcdefghijklmn', 'abcdefghijklmn', 'abcdefghijklmn') 


DECLARE 
    @NULL_replacement NVARCHAR(10) = '*NULL*'
    ,@separator       NVARCHAR(10) = '/SEP\'

;WITH cte AS
(
    SELECT
        id, DESCRIPTION, number,
        word1, word2, word3,
        COMBINED_VALUE,
        valid_f
    FROM 
        #t
    OUTER APPLY
        (SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
         FROM (VALUES  (word1)
                      ,(word2)
                      ,(word3) )v(v)
         JOIN (VALUES  (@NULL_replacement)
                      ,(@separator       ) )w(w)
           ON v LIKE '%' + w + '%' COLLATE Czech_BIN2 
        )validator
    OUTER APPLY 
        (SELECT COMBINED_VALUE = CONCAT_WS(
              @separator
              ,COALESCE(CONVERT(NVARCHAR(100), number), @NULL_replacement) 
              ,COALESCE(CONVERT(NVARCHAR(100), word1 ), @NULL_replacement) 
              ,COALESCE(CONVERT(NVARCHAR(100), word2 ), @NULL_replacement) 
              ,COALESCE(CONVERT(NVARCHAR(100), word3 ), @NULL_replacement)) 
        )combiner
)
SELECT
    id
    ,DESCRIPTION
    ,number
    ,word1_encapsulated          = '>' + word1 + '<'
    ,word2_encapsulated          = '>' + word2 + '<'
    ,word3_encapsulated          = '>' + word3 + '<'
    ,COMBINED_VALUE_encapsulated = '>' + COMBINED_VALUE + '<'
    ,valid_f
    ,repeats
    ,COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE)
FROM
    cte o
OUTER APPLY
    (SELECT repeats = COUNT(*) 
     FROM cte i
     WHERE o.COMBINED_VALUE LIKE i.COMBINED_VALUE COLLATE Czech_BIN2 
    )repeatfinder
ORDER BY 
    id

I consider the example above proof that I need to replace NULLs and use a separator. See rows 4 and 5, if these columns get concatenated with CONCAT, you get the same output. See the column DESCRIPTION for more details.

NULL replacement and separator

  • Prevent different rows forming the same COMBINED_VALUE
  • Must be disallowed as part of input. This must be verified, input must be sanitized
  • Should very unlikely be part of input, should be "rare"

Script #2

After the above I came up with this mostly similar query that uses VARBINARY to make the output smaller, particularly for numeric data. Null replacement and separator are "super rare". The ASCII codes for them are labeled "unused" in the ASCII table.

DECLARE 
    @separator_bin VARBINARY(5) = 
      CONVERT(VARBINARY(5), CHAR(129)+CHAR(141)+CHAR(143)+CHAR(144)+CHAR(157))
    ,@NULL_repl_bin VARBINARY(5) = 
      CONVERT(VARBINARY(5), CHAR(157)+CHAR(144)+CHAR(143)+CHAR(141)+CHAR(129))

DECLARE 
    @separator_vc  VARCHAR(10) = CONVERT(VARCHAR(10), @separator_bin, 2)
    ,@NULL_repl_vc VARCHAR(10) = CONVERT(VARCHAR(10), @NULL_repl_bin, 2)

;WITH cte AS
(
    SELECT
        id, DESCRIPTION, number,
        word1, word2, word3,
        COMBINED_VALUE, 
        valid_f, word1_binary
    FROM 
        #t
    OUTER APPLY
        (SELECT 
             number_binary = CONVERT(VARBINARY(8000), number)
             ,word1_binary = CONVERT(VARBINARY(8000), word1 )
             ,word2_binary = CONVERT(VARBINARY(8000), word2 )
             ,word3_binary = CONVERT(VARBINARY(8000), word3 )
        )binaries
    OUTER APPLY
        (SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
         FROM (VALUES  (CONVERT(VARCHAR(8000), number_binary, 2))
                      ,(CONVERT(VARCHAR(8000), word1_binary , 2))
                      ,(CONVERT(VARCHAR(8000), word2_binary , 2))
                      ,(CONVERT(VARCHAR(8000), word3_binary , 2)) )v(v)
         JOIN (VALUES  (@separator_vc)
                      ,(@NULL_repl_vc) )w(w) 
           ON CHARINDEX(w, v COLLATE Latin1_General_BIN2) > 0
        )validator
    OUTER APPLY
        (SELECT COMBINED_VALUE = 
                            COALESCE(number_binary, @NULL_repl_bin) 
         + @separator_bin + COALESCE(word1_binary , @NULL_repl_bin) 
         + @separator_bin + COALESCE(word2_binary , @NULL_repl_bin) 
         + @separator_bin + COALESCE(word3_binary , @NULL_repl_bin) 
        )combiner
    )
    SELECT
        id
        ,DESCRIPTION
        ,number
        ,word1_encapsulated = '>' + word1 + '<'
        ,word2_encapsulated = '>' + word2 + '<'
        ,word3_encapsulated = '>' + word3 + '<'
        ,COMBINED_VALUE
        ,valid_f
        ,repeats
        ,COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE)
        ,word1_varchar       = CONVERT(VARCHAR(8000), word1_binary, 2)
        ,separator_varchar   = @separator_varchar
    FROM 
        cte o
    OUTER APPLY
        (SELECT repeats = COUNT(*) 
         FROM cte i
         WHERE o.COMBINED_VALUE = i.COMBINED_VALUE
        )repeatfinder
    ORDER BY 
        id

Solution

  • I could suggest a binary concatenation that includes a combination of following for each column:

    1. A null indicator IIF(col IS NULL THEN 0x01, 0x00),
    2. The data length ISNULL(CAST(DATALENGTH(col) AS VARBINARY(MAX)), 0x), and
    3. The value itself ISNULL(CAST(col AS VARBINARY(MAX)), 0x).

    The null indicator and data length parts can be selectively omitted for non-nullable and fixed-length column types, respectively.

    This will unambiguously handle any combination of nullable and variable-length columns, and you don't need to worry about delimiters in your data.

    The concatenated result can be fed into a good hash function like SHA-256. The chance of a SHA-256 collision is about 1 every 10 billion years, if you add a million unique rows to your database every femtosecond.

    SELECT
        t.*,
        --C.CombinedValue,
        H.HashValue
    FROM #t t
    CROSS APPLY (
        SELECT 
            + IIF(number IS NULL, 0x01, 0x00)
              + ISNULL(CAST(DATALENGTH(number) AS VARBINARY(MAX)), 0x) -- Optional
              + ISNULL(CAST(number AS VARBINARY(MAX)), 0x)
            + IIF(word1 IS NULL, 0x01, 0x00)
              + ISNULL(CAST(DATALENGTH(word1) AS VARBINARY(MAX)), 0x)
              + ISNULL(CAST(word1 AS VARBINARY(MAX)), 0x)
            + IIF(word2 IS NULL, 0x01, 0x00)
              + ISNULL(CAST(DATALENGTH(word2) AS VARBINARY(MAX)), 0x)
              + ISNULL(CAST(word2 AS VARBINARY(MAX)), 0x)
            + IIF(word3 IS NULL, 0x01, 0x00)
              + ISNULL(CAST(DATALENGTH(word3) AS VARBINARY(MAX)), 0x)
              + ISNULL(CAST(word3 AS VARBINARY(MAX)), 0x)
            AS CombinedValue
    ) C
    CROSS APPLY (
        SELECT HASHBYTES('SHA2_256',C.CombinedValue) AS HashValue
    ) H
    ORDER BY T.id
    

    The above can be easily rearranged to use a CTE if desired.

    Sample results:

    id DESCRIPTION number word1 word2 word3 HashValue
    1 base row 1 aa aa aa 0xFB9299EC58E4DAC1465BD3025517F735DC140BFE1CB8EE1F627C610514DF2A2D
    2 trailing space in different places 1 aa aa aa 0x6E462C1EF5F7CD583F74F10594CFF454C6EAD556D42564153C1F9B558ECB3C3A
    3 trailing space in different places 1 aa aa aa 0x862F08AC15AFEF147455038FF9F3AFFEB95826530BFE3DA2C87EBB8AFDAAA572
    4 NULLs in different places 1 null aa aa 0xBDF6691C3F2207B854251CBC5A527E9A1DAFF7282F68843F97178E47A2A91900
    5 NULLs in different places 1 aa null aa 0xB53C88CA610E62159C942AF39D61165ECDB12F52E0B65D62A126BB69257AD316
    6 different rows, screwery with NULL 1 aa aa null 0x86D4A0F8808C4691E121001E199AD69160283768824BE86723AEE191FC5387A3
    7 different rows, screwery with NULL 1 aa aa NULL 0x27CDC248AEDB67EA3FEE935FB28FFD6F8BCF144D0B177824DAEEA7D5F2C69D2F
    8 different rows, screwery with sep. 1 aa aa/SEP\aa aa 0x3CD630032648215DD6AFACE7098AE0A3DCA13D8180490B99046195F1B84CAE84
    9 different rows, screwery with sep. 1 aa/SEP\aa aa aa 0x507C17CE4F560775AE6A0973A5BB33A8E30C52160927C7D682914446AC7ADC1E
    10 longer values 1111111111 abcdefghijklmn abcdefghijklmn abcdefghijklmn 0x04DD3801CEBEC074D512AFDA327CDB1F05089E53EE2CD93AC72BC260145A95DC

    See this db<>fiddle for a demo.