Search code examples
sqlsql-serveroptimizationlevenshtein-distance

How can I optimize retrieving lowest edit distance from a large table in SQL?


I'm having troubles optimizing this Levenshtein Distance calculation I'm doing. I need to do the following:

  1. Get the record with the minimum distance for the source string as well as a trimmed version of the source string

  2. Pick the record with the minimum distance
  3. If the min distances are equal (original vs trimmed), choose the trimmed one with the lowest distance
  4. If there are still multiple records that fall under the above two categories, pick the one with the highest frequency

Here's my working version:

DECLARE @Results TABLE
(
    ID int,
    [Name] nvarchar(200), 
    Distance int, 
    Frequency int, 
    Trimmed bit
)


INSERT INTO @Results
    SELECT ID, 
           [Name], 
           (dbo.Levenshtein(@Source, [Name])) As Distance,
           Frequency, 
           'False' As Trimmed
    FROM
           MyTable

INSERT INTO @Results
    SELECT ID, 
           [Name], 
           (dbo.Levenshtein(@SourceTrimmed, [Name])) As Distance,
           Frequency, 
           'True' As Trimmed
    FROM
           MyTable

SET @ResultID = (SELECT TOP 1 ID FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @Result = (SELECT TOP 1 [Name] FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @ResultDist = (SELECT TOP 1 Distance FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @ResultTrimmed = (SELECT TOP 1 Trimmed FROM @Results ORDER BY Distance, Trimmed, Frequency)

I believe what I need to do here is to..

  1. Not dumb the results to a temporary table
  2. Do only 1 select from `MyTable`
  3. Setting the results right in the select from the initial select statement. (Since select will set variables and you can set multiple variables in one select statement)

I know there has to be a good implementation to this but I can't figure it out... this is as far as I got:

SELECT top 1 @ResultID = ID, 
             @Result = [Name], 
            (dbo.Levenshtein(@Source, [Name])) As distOrig,
             (dbo.Levenshtein(@SourceTrimmed, [Name])) As distTrimmed,
             Frequency
FROM
    MyTable
WHERE /* ... yeah I'm lost */
ORDER BY distOrig, distTrimmed, Frequency 

Any ideas?


Solution

  • I think your attempt differs from the code that you say works in that the working code orders by distance first, whether or not that is original or trimmed distance. Your attempt orders by original distance first, then trimmed.

    I'm not sure I understand what you're trying to do entirely, but does the following do what you need?

    SELECT TOP 1
        @ResultId = ID,
        @Result = [Name],
        @ResultDist = distOrig,
        @ResultTrimmed = distTrimmed
    FROM (
        SELECT
            ID, [Name], 
            dbo.Levenshtein(@Source, [Name]) As distOrig,
            dbo.Levenshtein(@SourceTrimmed, [Name])) As distTrimmed,
            Frequency
        FROM MyTable
    ) AS T
    ORDER BY
        CASE WHEN distOrig > distTrimmed THEN distOrig ELSE distTrimmed END, -- Distance
        CASE WHEN distOrig > distTrimmed THEN 1 ELSE 0 END,                  -- Trimmed
        Frequency                                                            -- Frequency