Search code examples
sql-serveretlhashbytes

Avoiding multiple rows in call to 'HashBytes' with 'FOR XML RAW'


In a generic ETL flow I'm selecting data from a source (Table, file, webservice etc.) into a datamart.

I'm using the MS-SQL hashbyte-function to identify whether a row has changed.

E.g. on a city-table with CountryCode, Zip and CityName with primary key = CountryCode and Zip

 SELECT CountryCode, Zip,  CityName, 
 CONVERT(VARCHAR(40), HASHBYTES('MD5', 
 (SELECT CityName FROM spo.City sub 
  WHERE sub.Zip = main.Zip  
    AND  sub.CountryCode = main.CountryCode  
    FOR XML RAW )), 2) AS SysCheckSumSCD1 
FROM spo.City main 

My issue is in case of duplicate primary keys in the source

Then the subselect used in HASHBYTES, will include both columns and both rows will have the same hashkey. Hence my datamart will not be updated correctly. Furthermore I won't be informed that there are duplicates in the source.

Example result:

Zip     CountryCode     CityName        SysCheckSumSCD1
14600   FR              Honfleur        6D8EF511B35621FC0F5CC67AA6B98EEA
14600   FR              Equemauville    6D8EF511B35621FC0F5CC67AA6B98EEA

Instead I would like the call to fail.

Previously I used the CHECKSUM function which took an actual row as input and failed in the above example. But I had to change to HASHBYTES which unfortunately only takes a string as input. Which is the reason for the 'FOR XML RAW' part

I would appreciate any useful input. Preferable something that can be implemented just by changing the above SQL-statement, since it's a part of a large generic solution. And my hands are a little bit tied.

I have been thinking about adding a dummy aggregate function to force the error. But haven't been able to figure out how to do it.


Solution

  • After returning to the keyboard after the weekend with a recharged brain. I realize that the solution is actually quite simple.

    Just make the statement passed to HASHBYTES a subquery by adding 'SELECT () AS FOO' around it. :-|

    So the statement will look like this.

    SELECT CountryCode, Zip,  CityName, 
    CONVERT(VARCHAR(40), HASHBYTES('MD5', 
         (SELECT 
          (SELECT CityName 
          FROM spo.City sub 
          WHERE sub.Zip = main.Zip 
          AND  sub.CountryCode = main.CountryCode) AS FOO 
         FOR XML RAW )), 2) AS SysCheckSumSCD1 
    FROM spo.City main