I need to implement a function to look for similar names. It's for the development of a new system that has migrated the data of a previous system. One of the features would be that on account creation, it would try and look for person records that are there with a similar name and propose them. Examples for similar names for "John Johnson" could be:
To achieve this I've created some SQL functions and functional wise they work:
[dbo].[Levenshtein]: A copy of the top rated answer from this question
[dbo].[SplitString]: Which is to split a name string based on '/', '', '&', '+' and '-'
CREATE FUNCTION [dbo].[SplitString](
@s nvarchar(4000)
)
RETURNS @table TABLE ([value] VARCHAR(4000))
WITH SCHEMABINDING
AS
BEGIN
DECLARE @repl VARCHAR(4000) = REPLACE(REPLACE(REPLACE(REPLACE(@s, '/', '-'),'&', '-'),'+', '-'),'\', '-');
INSERT INTO @table
SELECT TRIM(value) FROM STRING_SPLIT(@repl, '-', NULL)
WHERE RTRIM(value) <> '';
RETURN
END
[dbo].[IsSimilar]: Takes 2 strings, calls the split function and checks if any part of the splits are similar, meaning within Levenshtein distance 5, an initial, or 'Fam'
CREATE FUNCTION [dbo].[IsSimilar](
@s nvarchar(4000)
, @t nvarchar(4000)
)
RETURNS BIT
WITH SCHEMABINDING
AS
BEGIN
DECLARE @sT TABLE (idx int IDENTITY(1,1), [value] VARCHAR(4000))
DECLARE @tT TABLE (idx int IDENTITY(1,1), [value] VARCHAR(4000))
DECLARE @sNrOfWords INT,
@tNrOfWords INT,
@sCount INT = 1,
@tCount INT = 1,
@sVal VARCHAR(4000),
@tVal VARCHAR(4000)
IF (@s = 'fam' OR @s = 'fam.' OR @t = 'fam' OR @t = 'fam.')
return 1
INSERT INTO @sT SELECT [value] FROM [dbo].[SplitString](@s)
INSERT INTO @tT SELECT [value] FROM [dbo].[SplitString](@t)
SET @sNrOfWords = (SELECT COUNT([value]) FROM @sT)
SET @tNrOfWords = (SELECT COUNT([value]) FROM @tT)
IF (@sNrOfWords > 0 AND @tNrOfWords > 0)
BEGIN
WHILE (@sCount <= (SELECT MAX(idx) FROM @sT))
BEGIN
SET @sVal = (SELECT [value] FROM @sT WHERE idx = @sCount)
WHILE (@tCount <= (SELECT MAX(idx) FROM @tT))
BEGIN
SET @tVal = (SELECT [value] FROM @tT WHERE idx = @tCount)
IF (((LEN(@sVal) = 1 OR LEN(@tVal) = 1 OR SUBSTRING(@sVal, 2, 1) IN ('.', ' ') OR SUBSTRING(@tVal, 2, 1) IN ('.', ' ')) AND SUBSTRING(@sVal, 1, 1) = SUBSTRING(@tVal, 1, 1)) OR ((SELECT [dbo].[Levenshtein](@sVal,@tVal, 5)) IS NOT NULL))
RETURN 1
SET @tCount = @tCount + 1
END
SET @sCount = @sCount + 1
SET @tCount = 1
END
END
RETURN 0
END
Sadly this solution isn't the most performant :( It takes 12-13s to find this record based on my mispelled name, which off course is too long. The full table is only 512 records at the moment.
Any help on getting this more performant? I know looping isn't recomended in SQL, so probably something to gain there. I'm not a DBA or SQL specialist, no idea how to write that differently without the loops. Didn't think I could use a join as there's no equality.
After implementing the suggestions in the comments on the OP, I managed to cut down the same SELECT statement from 12-13s to about 1s, which is a lot more acceptable.
The SplitString has been changed to an inline function:
Create FUNCTION [dbo].[SplitString](
@s nvarchar(4000)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN(
SELECT TRIM(value) AS value FROM STRING_SPLIT(REPLACE(REPLACE(REPLACE(REPLACE(@s, '/', '-'),'&', '-'),'+', '-'),'\', '-'), '-', NULL)
WHERE RTRIM(value) <> ''
);
Cutting down on the variables and using a Join statement for the IsSimilar function gives a boost as well:
CREATE FUNCTION [dbo].[IsSimilar](
@s nvarchar(4000)
, @t nvarchar(4000)
)
RETURNS BIT
AS
BEGIN
IF (@s = 'fam' OR @s = 'fam.' OR @t = 'fam' OR @t = 'fam.')
return 1
IF (LEN(TRIM(@s)) > 0 AND LEN(TRIM(@t)) > 0)
BEGIN
RETURN (SELECT IIF (EXISTS(SELECT [sT].[value] FROM (SELECT [value] FROM [dbo].[SplitString](@s)) AS sT INNER JOIN (SELECT [value] FROM [dbo].[SplitString](@t)) AS tT ON (((LEN([sT].[value]) = 1 OR LEN([tT].[value]) = 1 OR SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND SUBSTRING([sT].[value], 1, 1) = SUBSTRING([tT].[value], 1, 1)) OR (NOT(SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND (SELECT [dbo].[Levenshtein]([sT].[value],[tT].[value], 5)) IS NOT NULL)) ), 1, 0))
END
RETURN 0
END
I don't know how much this boost will hold up to real big data, but in this case that'll not be the case as Person records get linked to Account records with every new account creation and only Person records with AccountID = NULL will be considered.