Search code examples
t-sqlstring-matching

Matching characters between two strings - TSQL


I have two pieces of code which splits a string into single characters and returns it row by row. Does anyone know of any in-built functions which can essentially take the split strings in order to determine whether they are similar to each other ?


SELECT SUBSTRING(Aux.Name, X.number+1, 1) AS Split 
FROM 
    (SELECT 'Wes Davids' as Name) AS Aux
    INNER JOIN master..spt_values X ON X.number < LEN(Aux.Name)
WHERE X.type = 'P'
 1 W
 2 e
 3 s
 4 
 5 D
 6 a
 7 v
 8 i
 9 d
10 s
SELECT SUBSTRING(Aux.Name, X.number+1, 1) AS Split 
FROM 
    (SELECT 'W Davids' as Name) AS Aux
    INNER JOIN master..spt_values X ON X.number < LEN(Aux.Name)
WHERE X.type = 'P'

 1 W
 2 
 3 D
 4 a
 5 v
 6 i
 7 d
 8 s

Solution

  • For splitting a string into N-Grams, specifically unigrams in your case, you should use ngrams8k. For example:

    SELECT ng.* FROM dbo.ngrams8k('Wes Davids',1) AS ng;
    

    Returns:

    position    token
    ----------- ---------
    1           W
    2           e
    3           s
    4            
    5           D
    6           a
    7           v
    8           i
    9           d
    10          s
    

    You can use it, for example, to quickly get the longest common substring between two strings as shown below. You can create a similarity score by dividing the length of the longest common substring (LCSS) by the length of the longest of the two strings (L2):

    DECLARE
      @string1 VARCHAR(100) = 'Joe Cook',
      @string2 VARCHAR(100) = 'J Cook';
    
    SELECT TOP (1) *, LCSS = LEN(TRIM(ng.token)), similarity = 1.*LEN(TRIM(ng.token))/b.L2
    FROM (VALUES(
      CASE WHEN LEN(@string1)<= LEN(@string2) THEN @string1      ELSE @string2 END,
      CASE WHEN LEN(@string1)<= LEN(@string2) THEN @string2      ELSE @string1 END,
      CASE WHEN LEN(@string1)<= LEN(@string2) THEN LEN(@string1) ELSE LEN(@string2)END,
      CASE WHEN LEN(@string1)<= LEN(@string2) THEN LEN(@string2) ELSE LEN(@string1)END
    )) AS b(S1,S2,L1,L2)
    CROSS JOIN  master..spt_values            AS x
    CROSS APPLY dbo.ngrams8k(b.S1,x.number+1) AS ng
    WHERE       x.[type] = 'P' 
    AND         x.number < b.L1
    AND         CHARINDEX(ng.token,b.S2) > 0
    ORDER BY    LEN(TRIM(ng.token)) DESC
    GO
    

    Returns:

    S1             S2           position             token      LCSS  Similarity
    -------------- ------------ -------------------- ---------- ----- ---------------------------------------
    J Cook         Joe Cook     3                    Cook       4     0.50000000000
    

    You can get a better similarity score by subtracting the Levenshtein (lev) distance from the length of the shorter of the two strings (L1-Lev), then dividing that value by L2: (L1-Lev)/L2. You can use Phil Factor's Levenshtein Function for this.

    DECLARE
      @string1 VARCHAR(100) = 'James Cook',
      @string2 VARCHAR(100) = 'Jamess Cook';
    
    SELECT 
      Lev        = dbo.LEVENSHTEIN(@string1,@string2), 
      Similarity = (1.*b.L1-dbo.LEVENSHTEIN(@string1,@string2))/b.L2
    FROM (VALUES(
      CASE WHEN LEN(@string1)<= LEN(@string2) THEN @string1      ELSE @string2 END,
      CASE WHEN LEN(@string1)<= LEN(@string2) THEN @string2      ELSE @string1 END,
      CASE WHEN LEN(@string1)<= LEN(@string2) THEN LEN(@string1) ELSE LEN(@string2)END,
      CASE WHEN LEN(@string1)<= LEN(@string2) THEN LEN(@string2) ELSE LEN(@string1)END
    )) AS b(S1,S2,L1,L2)
    GO
    

    Returns:

    Lev         Similarity
    ----------- ---------------------------------------
    1           0.81818181818
    

    This is an example of how to use the Levenshtein Distance for measuring similarity. There are other algorithms such as the Damerau–Levenshtein distance and The Longest Common Subsequence. Damerau–Levenshtein is more precise but slower (Phil Factor has a Damerau–Levenshtein function in the aforementioned link as well as a [Longest Common Subsequence function] in a different post7. The formula for similarity is the same (L1-DLev)/L2. The Longest Common Subsequence (LCSSq) is more accurate (but slower) than the longest common substring, but uses the same formula for calculating a similarity score: (LCSSq/L2)

    Hopefully this will get you started.