In a generic ETL flow I'm selecting data from a source (Table, file, webservice etc.) into a datamart.
I'm using the MS-SQL hashbyte-function to identify whether a row has changed.
E.g. on a city-table with CountryCode, Zip and CityName with primary key = CountryCode and Zip
SELECT CountryCode, Zip, CityName,
CONVERT(VARCHAR(40), HASHBYTES('MD5',
(SELECT CityName FROM spo.City sub
WHERE sub.Zip = main.Zip
AND sub.CountryCode = main.CountryCode
FOR XML RAW )), 2) AS SysCheckSumSCD1
FROM spo.City main
My issue is in case of duplicate primary keys in the source
Then the subselect used in HASHBYTES, will include both columns and both rows will have the same hashkey. Hence my datamart will not be updated correctly. Furthermore I won't be informed that there are duplicates in the source.
Example result:
Zip CountryCode CityName SysCheckSumSCD1
14600 FR Honfleur 6D8EF511B35621FC0F5CC67AA6B98EEA
14600 FR Equemauville 6D8EF511B35621FC0F5CC67AA6B98EEA
Instead I would like the call to fail.
Previously I used the CHECKSUM function which took an actual row as input and failed in the above example. But I had to change to HASHBYTES which unfortunately only takes a string as input. Which is the reason for the 'FOR XML RAW' part
I would appreciate any useful input. Preferable something that can be implemented just by changing the above SQL-statement, since it's a part of a large generic solution. And my hands are a little bit tied.
I have been thinking about adding a dummy aggregate function to force the error. But haven't been able to figure out how to do it.
After returning to the keyboard after the weekend with a recharged brain. I realize that the solution is actually quite simple.
Just make the statement passed to HASHBYTES a subquery by adding 'SELECT () AS FOO' around it. :-|
So the statement will look like this.
SELECT CountryCode, Zip, CityName,
CONVERT(VARCHAR(40), HASHBYTES('MD5',
(SELECT
(SELECT CityName
FROM spo.City sub
WHERE sub.Zip = main.Zip
AND sub.CountryCode = main.CountryCode) AS FOO
FOR XML RAW )), 2) AS SysCheckSumSCD1
FROM spo.City main