Search code examples
sql-servert-sqllinked-serversoundex

TSQL Using soundex across linked servers


(Object names in this post have been changed to protect the idiocent).

I was matching lists of employee names where one list was on Server_A

  • column collation:Latin1_General_CI_AS
  • column data type: VARCHAR(50)
  • Column nullability NULL
  • SQL Server version 12.0.4100.1
  • Compatibility: 120

...and the other list was on Server_B

  • column collation:Latin1_General_CI_AS
  • column data type: VARCHAR(30)
  • Column nullability NOT NULL
  • SQL Server version 10.50.4000.0
  • Compatibility: 100

In certain circumstances values that were identical were returning different SOUNDEX values. So effectively saying no match existed where a match did indeed exist. In these cases typically (but not always) there was a level of complexity of the name. So print soundex('Czosgro') (not a real name) returns 'C226' on one server and 'C260' on the other. Arrgh! So this (sample) query returns rows where I wasn't not expecting it to:

WITH srv2 as 
(
SELECT DISTINCT psn.surname
,soundex(psn.surname) as sx
FROM blah.personnel psn --synonym to linked server table
)
, srv1 as 
(
SELECT DISTINCT wrk.Surname
,soundex(wrk.Surname) as sx
FROM hr.Workers                         wrk
)
SELECT *
FROM srv2
JOIN srv1 ON rap.surname = srv1.Surname --Same name
WHERE srv2.sx <> srv1.sx                --Different soundex
;

I was able to solve the problem by selecting the data from srv2 into a temp table on the local server. Running the following shows the problem is no longer evident:

IF OBJECT_ID('tempdb..#srv2') IS NOT NULL DROP TABLE #srv2;
SELECT DISTINCT psn.surname 
INTO #srv2
FROM blah.personnel psn --synonym to linked server table
;

WITH srv2 as 
(
SELECT DISTINCT psn.surname
,soundex(psn.surname) as sx
FROM #srv2 psn
)
, srv1 as 
(
SELECT DISTINCT wrk.Surname
,soundex(wrk.Surname) as sx
FROM hr.Workers                         wrk
)
SELECT *
FROM srv2
JOIN srv1 ON srv2.surname = srv1.Surname
WHERE srv2.sx <> srv1.sx
; 

My question is this: Is this problem caused by the process that decides on the best plan assuming that functions like SOUNDEX() across different versions of SQL Server are all consistent? It seems as though it simply pushes the query unaltered, back to the remote server, which applies it's version of SOUNDEX(). This, it seems, is not guaranteed to match. Supplementary question: Could this reasoning be extrapolated to all functions?


Solution

  • It does indeed seem that SOUNDEX() on different versions of SQL Server behaves differently for a small group of "sounds" (such as "cz" sounds). The solution I've described above seems workable. Users should be aware that cross server comparisons using the SOUNDEX() functions on each instance may not yeild consistent results.