Search code examples
sql-serverhashbytes

SQL Server 2014: HASHBYTES returning different value for same string


I am working with a CSV file of addresses that get uploaded to an FTP site on a daily basis. The address records consist of Address_Line1, Address_Line2, City, State, Zip_Code, and Country. There are multiple duplicate addresses in the CSV. My task is to compare all of the addresses in the CSV file to an existing address dimension, then only load new addresses. To do this, I am first loading from the CSV into a staging table, then running the following query to generate a hash in order to do the comparison:

UPDATE STG_ADDRESS
SET ADDRESS_HASH = HASHBYTES(
    'SHA1'
    ,ISNULL(ADDRESS_LINE1, 'N/A') + 
    ISNULL(ADDRESS_LINE2, 'N/A') +
    ISNULL(CITY, 'N/A') +
    ISNULL(STATE, 'N/A') +
    ISNULL(ZIP_CODE, 'N/A') + 
    ISNULL(COUNTRY, 'N/A'));

This is working fine, with one exception. The HASHBYTES function is generating multiple hashes for the same exact address. For today's upload, I ran the following query and got 37 distinct addresses:

SELECT DISTINCT 
    ISNULL(ADDRESS_LINE1, 'N/A') 
    + ISNULL(ADDRESS_LINE2, 'N/A')
    + ISNULL(CITY, 'N/A')
    + ISNULL(STATE, 'N/A')
    + ISNULL(ZIP_CODE, 'N/A') 
    + ISNULL(COUNTRY, 'N/A')

FROM STG_ADDRESS

After updating with the Hash, I ran the following query and got 43 records:

SELECT DISTINCT 
    ISNULL(ADDRESS_LINE1, 'N/A') 
    + ISNULL(ADDRESS_LINE2, 'N/A')
    + ISNULL(CITY, 'N/A')
    + ISNULL(STATE, 'N/A')
    + ISNULL(ZIP_CODE, 'N/A') 
    + ISNULL(COUNTRY, 'N/A')
    ,ADDRESS_HASH

FROM STG_ADDRESS

I double checked this with the following query:

SELECT DISTINCT 
    ISNULL(ADDRESS_LINE1, 'N/A') 
    + ISNULL(ADDRESS_LINE2, 'N/A')
    + ISNULL(CITY, 'N/A')
    + ISNULL(STATE, 'N/A')
    + ISNULL(ZIP_CODE, 'N/A') 
    + ISNULL(COUNTRY, 'N/A')
    ,COUNT(ADDRESS_HASH)

FROM STG_ADDRESS

GROUP BY
    ISNULL(ADDRESS_LINE1, 'N/A') 
    + ISNULL(ADDRESS_LINE2, 'N/A')
    + ISNULL(CITY, 'N/A')
    + ISNULL(STATE, 'N/A')
    + ISNULL(ZIP_CODE, 'N/A') 
    + ISNULL(COUNTRY, 'N/A')

HAVING COUNT(ADDRESS_HASH) > 1

And saw that there are six addresses that SQL server considers the same when running a SELECT DISTINCT, but considers different somehow when creating the hash.

Is there any scenario where the same string could result in a different hash being created? If so, what can be done to correct the issue?


Solution

  • Whitespace and case should be normalized before calling HASHBYTES, because it is always case-sensitive. By default, string comparisons during normal SQL operations is case-insensitive (you can modify this with the COLLATION server setting).

    LTRIM(RTRIM(TOLOWER(@value)))