Search code examples
sqlt-sqlfuzzy-searchfuzzy-logicfuzzy-comparison

SQL Fuzzy Join - MSSQL


I have two sets of data. Existing customers and potential customers.

My main objective is to figure out if any of the potential customers are already existing customers. However, the naming conventions of customers across data sets are inconsistent.

EXISTING CUSTOMERS

Customer /  ID
Ed's Barbershop /   1002
GroceryTown /   1003
Candy Place /   1004
Handy Man / 1005

POTENTIAL CUSTOMERS

Customer
Eds Barbershop
Grocery Town
Candy Place
Handee Man
Beauty Salon
The Apple Farm
Igloo Ice Cream
Ride-a-Long Bikes

I would like to write some type of select statement like below to reach my objective:

SELECT a.Customer, b.ID
FROM PotentialCustomers a LEFT JOIN
     ExistingCustomers B
     ON a.Customer = b.Customer

The results would look something like:

Customer /  ID
Eds Barbershop  / 1002
Grocery Town    / 1003
Candy Place / 1004
Handee Man  / 1005
Beauty Salon /  NULL
The Apple Farm /    NULL
Igloo Ice Cream / NULL
Ride-a-Long Bikes / NULL

I am vaguely familiar with the concepts of Levenshtein Distance and Double Metaphone but I am not sure how to apply it here.

Ideally I would want the JOIN portion of the SELECT statement to read something like: LEFT JOIN ExistingCustomers as B WHERE a.Customer LIKE b.Customer but I know that syntax is incorrect.

Any suggestions are welcomed. Thank you!


Solution

  • Here is how this could be done using Levenshtein Distance:

    Create this function:(Execute this first)

    CREATE FUNCTION ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))
    RETURNS int
    AS
    BEGIN
     DECLARE @s1_len int, @s2_len int
     DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
     DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
    
     SELECT
      @s1_len = LEN(@s1),
      @s2_len = LEN(@s2),
      @cv1 = 0x0000,
      @j = 1, @i = 1, @c = 0
    
     WHILE @j <= @s2_len
      SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
    
     WHILE @i <= @s1_len
     BEGIN
      SELECT
       @s1_char = SUBSTRING(@s1, @i, 1),
       @c = @i,
       @cv0 = CAST(@i AS binary(2)),
       @j = 1
    
      WHILE @j <= @s2_len
      BEGIN
       SET @c = @c + 1
       SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
       IF @c > @c_temp SET @c = @c_temp
       SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
       IF @c > @c_temp SET @c = @c_temp
       SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
     END
    
     SELECT @cv1 = @cv0, @i = @i + 1
     END
    
     RETURN @c
    END
    

    (Function developped by Joseph Gama)

    And then simply use this query to get matches

    SELECT A.Customer,
           b.ID,
           b.Customer
    FROM #POTENTIALCUSTOMERS a
         LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;
    

    Complete Script after you create that function:

    IF OBJECT_ID('tempdb..#ExistingCustomers') IS NOT NULL
        DROP TABLE #ExistingCustomers;
    
    CREATE TABLE #ExistingCustomers
    (Customer VARCHAR(255),
     ID       INT
    );
    
    INSERT INTO #ExistingCustomers
    VALUES
    ('Ed''s Barbershop',
     1002
    );
    
    INSERT INTO #ExistingCustomers
    VALUES
    ('GroceryTown',
     1003
    );
    
    INSERT INTO #ExistingCustomers
    VALUES
    ('Candy Place',
     1004
    );
    
    INSERT INTO #ExistingCustomers
    VALUES
    ('Handy Man',
     1005
    );
    
    IF OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') IS NOT NULL
        DROP TABLE #POTENTIALCUSTOMERS;
    
    CREATE TABLE #POTENTIALCUSTOMERS(Customer VARCHAR(255));
    
    INSERT INTO #POTENTIALCUSTOMERS
    VALUES('Eds Barbershop');
    
    INSERT INTO #POTENTIALCUSTOMERS
    VALUES('Grocery Town');
    
    INSERT INTO #POTENTIALCUSTOMERS
    VALUES('Candy Place');
    
    INSERT INTO #POTENTIALCUSTOMERS
    VALUES('Handee Man');
    
    INSERT INTO #POTENTIALCUSTOMERS
    VALUES('Beauty Salon');
    
    INSERT INTO #POTENTIALCUSTOMERS
    VALUES('The Apple Farm');
    
    INSERT INTO #POTENTIALCUSTOMERS
    VALUES('Igloo Ice Cream');
    
    INSERT INTO #POTENTIALCUSTOMERS
    VALUES('Ride-a-Long Bikes');
    
    SELECT A.Customer,
           b.ID,
           b.Customer
    FROM #POTENTIALCUSTOMERS a
         LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;
    

    Here you can find a T-SQL example at http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx