Search code examples
sql-servert-sqlazure-sql-databaselevenshtein-distance

How to make this IsSimilar(varchar,varchar) function more performant


I need to implement a function to look for similar names. It's for the development of a new system that has migrated the data of a previous system. One of the features would be that on account creation, it would try and look for person records that are there with a similar name and propose them. Examples for similar names for "John Johnson" could be:

  • John Johnson
  • Jon Jonsen
  • John & Jane Johnson-Peters
  • Fam. Johnsen
  • J. Johnson

To achieve this I've created some SQL functions and functional wise they work:

  • [dbo].[Levenshtein]: A copy of the top rated answer from this question

  • [dbo].[SplitString]: Which is to split a name string based on '/', '', '&', '+' and '-'

     CREATE FUNCTION [dbo].[SplitString](
         @s nvarchar(4000)
     )
     RETURNS @table TABLE ([value] VARCHAR(4000))
     WITH SCHEMABINDING
     AS
     BEGIN
         DECLARE @repl VARCHAR(4000) = REPLACE(REPLACE(REPLACE(REPLACE(@s, '/', '-'),'&', '-'),'+', '-'),'\', '-');
         INSERT INTO @table 
         SELECT TRIM(value) FROM STRING_SPLIT(@repl, '-', NULL)  
         WHERE RTRIM(value) <> '';
    
         RETURN
     END
    
  • [dbo].[IsSimilar]: Takes 2 strings, calls the split function and checks if any part of the splits are similar, meaning within Levenshtein distance 5, an initial, or 'Fam'

     CREATE FUNCTION [dbo].[IsSimilar](
         @s nvarchar(4000)
       , @t nvarchar(4000)
     )
     RETURNS BIT
     WITH SCHEMABINDING
     AS
     BEGIN
         DECLARE @sT TABLE (idx int IDENTITY(1,1), [value] VARCHAR(4000))
         DECLARE @tT TABLE (idx int IDENTITY(1,1), [value] VARCHAR(4000))
         DECLARE @sNrOfWords INT,
                 @tNrOfWords INT,
                 @sCount INT = 1,
                 @tCount INT = 1,
                 @sVal VARCHAR(4000),
                 @tVal VARCHAR(4000)
    
         IF (@s = 'fam' OR @s = 'fam.' OR @t = 'fam' OR @t = 'fam.')
             return 1
    
         INSERT INTO @sT SELECT [value] FROM [dbo].[SplitString](@s)
         INSERT INTO @tT SELECT [value] FROM [dbo].[SplitString](@t)
    
         SET @sNrOfWords = (SELECT COUNT([value]) FROM @sT)
         SET @tNrOfWords = (SELECT COUNT([value]) FROM @tT)
    
         IF (@sNrOfWords > 0 AND @tNrOfWords > 0)
         BEGIN
             WHILE (@sCount <= (SELECT MAX(idx) FROM @sT))
             BEGIN           
                 SET @sVal = (SELECT [value] FROM @sT WHERE idx = @sCount)
                 WHILE (@tCount <= (SELECT MAX(idx) FROM @tT))
                 BEGIN
                     SET @tVal = (SELECT [value] FROM @tT WHERE idx = @tCount)
                     IF (((LEN(@sVal) = 1 OR LEN(@tVal) = 1 OR SUBSTRING(@sVal, 2, 1) IN ('.', ' ') OR SUBSTRING(@tVal, 2, 1) IN ('.', ' ')) AND SUBSTRING(@sVal, 1, 1) = SUBSTRING(@tVal, 1, 1)) OR ((SELECT [dbo].[Levenshtein](@sVal,@tVal, 5)) IS NOT NULL))
                         RETURN 1
                     SET @tCount = @tCount + 1
                 END
                 SET @sCount = @sCount + 1
                 SET @tCount = 1
             END
         END
         RETURN 0
     END
    

Sadly this solution isn't the most performant :( enter image description here It takes 12-13s to find this record based on my mispelled name, which off course is too long. The full table is only 512 records at the moment.

Any help on getting this more performant? I know looping isn't recomended in SQL, so probably something to gain there. I'm not a DBA or SQL specialist, no idea how to write that differently without the loops. Didn't think I could use a join as there's no equality.


Solution

  • After implementing the suggestions in the comments on the OP, I managed to cut down the same SELECT statement from 12-13s to about 1s, which is a lot more acceptable.

    The SplitString has been changed to an inline function:

    Create FUNCTION [dbo].[SplitString](
        @s nvarchar(4000)
    )
    RETURNS TABLE
    WITH SCHEMABINDING
    AS
    RETURN(
        SELECT TRIM(value) AS value FROM STRING_SPLIT(REPLACE(REPLACE(REPLACE(REPLACE(@s, '/', '-'),'&', '-'),'+', '-'),'\', '-'), '-', NULL)  
        WHERE RTRIM(value) <> ''
    );
    

    Cutting down on the variables and using a Join statement for the IsSimilar function gives a boost as well:

    CREATE FUNCTION [dbo].[IsSimilar](
        @s nvarchar(4000)
        , @t nvarchar(4000)
    )
    RETURNS BIT
    AS
    BEGIN
    
        IF (@s = 'fam' OR @s = 'fam.' OR @t = 'fam' OR @t = 'fam.')
            return 1
    
        IF (LEN(TRIM(@s)) > 0 AND LEN(TRIM(@t)) > 0)
        BEGIN
            RETURN (SELECT IIF (EXISTS(SELECT [sT].[value] FROM (SELECT [value] FROM [dbo].[SplitString](@s)) AS sT INNER JOIN (SELECT [value] FROM [dbo].[SplitString](@t)) AS tT ON (((LEN([sT].[value]) = 1 OR LEN([tT].[value]) = 1 OR SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND SUBSTRING([sT].[value], 1, 1) = SUBSTRING([tT].[value], 1, 1)) OR (NOT(SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND (SELECT [dbo].[Levenshtein]([sT].[value],[tT].[value], 5)) IS NOT NULL)) ), 1, 0))
        END
        RETURN 0
    END
    

    I don't know how much this boost will hold up to real big data, but in this case that'll not be the case as Person records get linked to Account records with every new account creation and only Person records with AccountID = NULL will be considered.