Search code examples
t-sqllevenshtein-distance

Check SQL Server table values against themselves


Imagine I had this table:

declare @tmpResults table ( intItemId int, strTitle nvarchar(100), intWeight float )

insert into @tmpResults values (1, 'Item One', 7)
insert into @tmpResults values (2, 'Item One v1', 6)
insert into @tmpResults values (3, 'Item Two', 6)
insert into @tmpResults values (4, 'Item Two v1', 7)

And a function, which we'll call fn_Lev that takes two strings, compares them to one another and returns the number of differences between them as an integer (i.e. the Levenshtein distance).

What's the most efficient way to query that table, check the fn_Lev value of each strTitle against all the other strTitles in the table and delete rows are similar to one another by a Levenshtein distance of 3, preferring to keeping higher intWeights?

So the after the delete, @tmpResults should contain

1   Item One    7
4   Item Two v1 7

I can think of ways to do this, but nothing that isn't horribly slow (i.e iterative). I'm sure there's a faster way?

Cheers, Matt


Solution

  • SELECT strvalue= CASE 
                    WHEN t1.intweight >= t2.intweight THEN t1.strtitle 
                    ELSE t2.strtitle 
                  END, 
           dist = Fn_lev(t1.strtitle, t2.strtitle) 
    FROM   @tmpResults AS t1 
           INNER JOIN @tmpResults AS t2 
             ON t1.intitemid < t2.intitemid 
    WHERE  Fn_lev(t1.strtitle, t2.strtitle) = 3 
    

    This will perform a self join that will match each row only once. It will excluding matching a row on itself or reverse of a previous match ie if A<->B is a match then B<->A isn't.

    The case statement selects the highest weighted result