Objective
I want to combine the four columns of a row into one value, named COMBINED_VALUE
.
These strict rules apply:
COMBINED_VALUE
COMBINED_VALUE
INT
, three columns are defined as NVARCHAR(100)
Notes
Among the first things I tried was BINARY_CHECKSUM
. It fails, resulting in the same value for different rows:
SELECT
a, b,
binary_checksum_failing = BINARY_CHECKSUM(a,b)
FROM
(VALUES ('i01', CONVERT(DECIMAL(9,2), 100)),
('i01', CONVERT(DECIMAL(9,2), 10)),
('i01', CONVERT(DECIMAL(9,2), 1)) )v(a, b)
HASHBYTES
with 'SHA2_256' works better, but it needs one well prepared COMBINED_VALUE
as input. If that is ensured, it has a chance to produce "very" unique hashes. I found that concatenating the columns should work but NULLs need to be replaced and columns need to be separated.
Find below two pieces of code showing my current progress and further notes. The first part shows a solution based on conversion to NVARCHAR
, the second part uses VARBINARY
. The example rows show edge cases and try to provoke failure.
Questions
COMBINED_VALUE
while passing validation?COMBINED_VALUE
? Is there a faster way? What is best practise in general?Script #1:
DROP TABLE IF EXISTS #t
CREATE TABLE #t
(
id INT IDENTITY(1,1)
,DESCRIPTION NVARCHAR(100)
,number INT
,word1 NVARCHAR(100)
,word2 NVARCHAR(100)
,word3 NVARCHAR(100)
)
INSERT INTO #t(DESCRIPTION, number, word1, word2, word3)
VALUES
('base row' , 1, 'aa', 'aa', 'aa')
,('trailing space in different places', 1, 'aa', 'aa', 'aa ')
,('trailing space in different places', 1, 'aa', 'aa ', 'aa')
,('NULLs in different places' , 1, NULL, 'aa', 'aa')
,('NULLs in different places' , 1, 'aa', NULL, 'aa')
,('different rows, screwery with NULL', 1, 'aa', 'aa', NULL)
,('different rows, screwery with NULL', 1, 'aa', 'aa', '*NULL*')
,('different rows, screwery with sep.', 1, 'aa' , 'aa/SEP\aa', 'aa')
,('different rows, screwery with sep.', 1, 'aa/SEP\aa', 'aa', 'aa')
,('longer values'
, 1111111111, 'abcdefghijklmn', 'abcdefghijklmn', 'abcdefghijklmn')
DECLARE
@NULL_replacement NVARCHAR(10) = '*NULL*'
,@separator NVARCHAR(10) = '/SEP\'
;WITH cte AS
(
SELECT
id, DESCRIPTION, number,
word1, word2, word3,
COMBINED_VALUE,
valid_f
FROM
#t
OUTER APPLY
(SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
FROM (VALUES (word1)
,(word2)
,(word3) )v(v)
JOIN (VALUES (@NULL_replacement)
,(@separator ) )w(w)
ON v LIKE '%' + w + '%' COLLATE Czech_BIN2
)validator
OUTER APPLY
(SELECT COMBINED_VALUE = CONCAT_WS(
@separator
,COALESCE(CONVERT(NVARCHAR(100), number), @NULL_replacement)
,COALESCE(CONVERT(NVARCHAR(100), word1 ), @NULL_replacement)
,COALESCE(CONVERT(NVARCHAR(100), word2 ), @NULL_replacement)
,COALESCE(CONVERT(NVARCHAR(100), word3 ), @NULL_replacement))
)combiner
)
SELECT
id
,DESCRIPTION
,number
,word1_encapsulated = '>' + word1 + '<'
,word2_encapsulated = '>' + word2 + '<'
,word3_encapsulated = '>' + word3 + '<'
,COMBINED_VALUE_encapsulated = '>' + COMBINED_VALUE + '<'
,valid_f
,repeats
,COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE)
FROM
cte o
OUTER APPLY
(SELECT repeats = COUNT(*)
FROM cte i
WHERE o.COMBINED_VALUE LIKE i.COMBINED_VALUE COLLATE Czech_BIN2
)repeatfinder
ORDER BY
id
I consider the example above proof that I need to replace NULLs and use a separator. See rows 4 and 5, if these columns get concatenated with CONCAT
, you get the same output. See the column DESCRIPTION
for more details.
NULL replacement and separator
COMBINED_VALUE
Script #2
After the above I came up with this mostly similar query that uses VARBINARY
to make the output smaller, particularly for numeric data. Null replacement and separator are "super rare". The ASCII codes for them are labeled "unused" in the ASCII table.
DECLARE
@separator_bin VARBINARY(5) =
CONVERT(VARBINARY(5), CHAR(129)+CHAR(141)+CHAR(143)+CHAR(144)+CHAR(157))
,@NULL_repl_bin VARBINARY(5) =
CONVERT(VARBINARY(5), CHAR(157)+CHAR(144)+CHAR(143)+CHAR(141)+CHAR(129))
DECLARE
@separator_vc VARCHAR(10) = CONVERT(VARCHAR(10), @separator_bin, 2)
,@NULL_repl_vc VARCHAR(10) = CONVERT(VARCHAR(10), @NULL_repl_bin, 2)
;WITH cte AS
(
SELECT
id, DESCRIPTION, number,
word1, word2, word3,
COMBINED_VALUE,
valid_f, word1_binary
FROM
#t
OUTER APPLY
(SELECT
number_binary = CONVERT(VARBINARY(8000), number)
,word1_binary = CONVERT(VARBINARY(8000), word1 )
,word2_binary = CONVERT(VARBINARY(8000), word2 )
,word3_binary = CONVERT(VARBINARY(8000), word3 )
)binaries
OUTER APPLY
(SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
FROM (VALUES (CONVERT(VARCHAR(8000), number_binary, 2))
,(CONVERT(VARCHAR(8000), word1_binary , 2))
,(CONVERT(VARCHAR(8000), word2_binary , 2))
,(CONVERT(VARCHAR(8000), word3_binary , 2)) )v(v)
JOIN (VALUES (@separator_vc)
,(@NULL_repl_vc) )w(w)
ON CHARINDEX(w, v COLLATE Latin1_General_BIN2) > 0
)validator
OUTER APPLY
(SELECT COMBINED_VALUE =
COALESCE(number_binary, @NULL_repl_bin)
+ @separator_bin + COALESCE(word1_binary , @NULL_repl_bin)
+ @separator_bin + COALESCE(word2_binary , @NULL_repl_bin)
+ @separator_bin + COALESCE(word3_binary , @NULL_repl_bin)
)combiner
)
SELECT
id
,DESCRIPTION
,number
,word1_encapsulated = '>' + word1 + '<'
,word2_encapsulated = '>' + word2 + '<'
,word3_encapsulated = '>' + word3 + '<'
,COMBINED_VALUE
,valid_f
,repeats
,COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE)
,word1_varchar = CONVERT(VARCHAR(8000), word1_binary, 2)
,separator_varchar = @separator_varchar
FROM
cte o
OUTER APPLY
(SELECT repeats = COUNT(*)
FROM cte i
WHERE o.COMBINED_VALUE = i.COMBINED_VALUE
)repeatfinder
ORDER BY
id
I could suggest a binary concatenation that includes a combination of following for each column:
IIF(col IS NULL THEN 0x01, 0x00)
,ISNULL(CAST(DATALENGTH(col) AS VARBINARY(MAX)), 0x)
, andISNULL(CAST(col AS VARBINARY(MAX)), 0x)
.The null indicator and data length parts can be selectively omitted for non-nullable and fixed-length column types, respectively.
This will unambiguously handle any combination of nullable and variable-length columns, and you don't need to worry about delimiters in your data.
The concatenated result can be fed into a good hash function like SHA-256. The chance of a SHA-256 collision is about 1 every 10 billion years, if you add a million unique rows to your database every femtosecond.
SELECT
t.*,
--C.CombinedValue,
H.HashValue
FROM #t t
CROSS APPLY (
SELECT
+ IIF(number IS NULL, 0x01, 0x00)
+ ISNULL(CAST(DATALENGTH(number) AS VARBINARY(MAX)), 0x) -- Optional
+ ISNULL(CAST(number AS VARBINARY(MAX)), 0x)
+ IIF(word1 IS NULL, 0x01, 0x00)
+ ISNULL(CAST(DATALENGTH(word1) AS VARBINARY(MAX)), 0x)
+ ISNULL(CAST(word1 AS VARBINARY(MAX)), 0x)
+ IIF(word2 IS NULL, 0x01, 0x00)
+ ISNULL(CAST(DATALENGTH(word2) AS VARBINARY(MAX)), 0x)
+ ISNULL(CAST(word2 AS VARBINARY(MAX)), 0x)
+ IIF(word3 IS NULL, 0x01, 0x00)
+ ISNULL(CAST(DATALENGTH(word3) AS VARBINARY(MAX)), 0x)
+ ISNULL(CAST(word3 AS VARBINARY(MAX)), 0x)
AS CombinedValue
) C
CROSS APPLY (
SELECT HASHBYTES('SHA2_256',C.CombinedValue) AS HashValue
) H
ORDER BY T.id
The above can be easily rearranged to use a CTE if desired.
Sample results:
id | DESCRIPTION | number | word1 | word2 | word3 | HashValue |
---|---|---|---|---|---|---|
1 | base row | 1 | aa | aa | aa | 0xFB9299EC58E4DAC1465BD3025517F735DC140BFE1CB8EE1F627C610514DF2A2D |
2 | trailing space in different places | 1 | aa | aa | aa | 0x6E462C1EF5F7CD583F74F10594CFF454C6EAD556D42564153C1F9B558ECB3C3A |
3 | trailing space in different places | 1 | aa | aa | aa | 0x862F08AC15AFEF147455038FF9F3AFFEB95826530BFE3DA2C87EBB8AFDAAA572 |
4 | NULLs in different places | 1 | null | aa | aa | 0xBDF6691C3F2207B854251CBC5A527E9A1DAFF7282F68843F97178E47A2A91900 |
5 | NULLs in different places | 1 | aa | null | aa | 0xB53C88CA610E62159C942AF39D61165ECDB12F52E0B65D62A126BB69257AD316 |
6 | different rows, screwery with NULL | 1 | aa | aa | null | 0x86D4A0F8808C4691E121001E199AD69160283768824BE86723AEE191FC5387A3 |
7 | different rows, screwery with NULL | 1 | aa | aa | NULL | 0x27CDC248AEDB67EA3FEE935FB28FFD6F8BCF144D0B177824DAEEA7D5F2C69D2F |
8 | different rows, screwery with sep. | 1 | aa | aa/SEP\aa | aa | 0x3CD630032648215DD6AFACE7098AE0A3DCA13D8180490B99046195F1B84CAE84 |
9 | different rows, screwery with sep. | 1 | aa/SEP\aa | aa | aa | 0x507C17CE4F560775AE6A0973A5BB33A8E30C52160927C7D682914446AC7ADC1E |
10 | longer values | 1111111111 | abcdefghijklmn | abcdefghijklmn | abcdefghijklmn | 0x04DD3801CEBEC074D512AFDA327CDB1F05089E53EE2CD93AC72BC260145A95DC |
See this db<>fiddle for a demo.