I have two pieces of code which splits a string into single characters and returns it row by row. Does anyone know of any in-built functions which can essentially take the split strings in order to determine whether they are similar to each other ?
SELECT SUBSTRING(Aux.Name, X.number+1, 1) AS Split
FROM
(SELECT 'Wes Davids' as Name) AS Aux
INNER JOIN master..spt_values X ON X.number < LEN(Aux.Name)
WHERE X.type = 'P'
1 W
2 e
3 s
4
5 D
6 a
7 v
8 i
9 d
10 s
SELECT SUBSTRING(Aux.Name, X.number+1, 1) AS Split
FROM
(SELECT 'W Davids' as Name) AS Aux
INNER JOIN master..spt_values X ON X.number < LEN(Aux.Name)
WHERE X.type = 'P'
1 W
2
3 D
4 a
5 v
6 i
7 d
8 s
For splitting a string into N-Grams, specifically unigrams in your case, you should use ngrams8k. For example:
SELECT ng.* FROM dbo.ngrams8k('Wes Davids',1) AS ng;
Returns:
position token
----------- ---------
1 W
2 e
3 s
4
5 D
6 a
7 v
8 i
9 d
10 s
You can use it, for example, to quickly get the longest common substring between two strings as shown below. You can create a similarity score by dividing the length of the longest common substring (LCSS) by the length of the longest of the two strings (L2):
DECLARE
@string1 VARCHAR(100) = 'Joe Cook',
@string2 VARCHAR(100) = 'J Cook';
SELECT TOP (1) *, LCSS = LEN(TRIM(ng.token)), similarity = 1.*LEN(TRIM(ng.token))/b.L2
FROM (VALUES(
CASE WHEN LEN(@string1)<= LEN(@string2) THEN @string1 ELSE @string2 END,
CASE WHEN LEN(@string1)<= LEN(@string2) THEN @string2 ELSE @string1 END,
CASE WHEN LEN(@string1)<= LEN(@string2) THEN LEN(@string1) ELSE LEN(@string2)END,
CASE WHEN LEN(@string1)<= LEN(@string2) THEN LEN(@string2) ELSE LEN(@string1)END
)) AS b(S1,S2,L1,L2)
CROSS JOIN master..spt_values AS x
CROSS APPLY dbo.ngrams8k(b.S1,x.number+1) AS ng
WHERE x.[type] = 'P'
AND x.number < b.L1
AND CHARINDEX(ng.token,b.S2) > 0
ORDER BY LEN(TRIM(ng.token)) DESC
GO
Returns:
S1 S2 position token LCSS Similarity
-------------- ------------ -------------------- ---------- ----- ---------------------------------------
J Cook Joe Cook 3 Cook 4 0.50000000000
You can get a better similarity score by subtracting the Levenshtein (lev) distance from the length of the shorter of the two strings (L1-Lev), then dividing that value by L2: (L1-Lev)/L2. You can use Phil Factor's Levenshtein Function for this.
DECLARE
@string1 VARCHAR(100) = 'James Cook',
@string2 VARCHAR(100) = 'Jamess Cook';
SELECT
Lev = dbo.LEVENSHTEIN(@string1,@string2),
Similarity = (1.*b.L1-dbo.LEVENSHTEIN(@string1,@string2))/b.L2
FROM (VALUES(
CASE WHEN LEN(@string1)<= LEN(@string2) THEN @string1 ELSE @string2 END,
CASE WHEN LEN(@string1)<= LEN(@string2) THEN @string2 ELSE @string1 END,
CASE WHEN LEN(@string1)<= LEN(@string2) THEN LEN(@string1) ELSE LEN(@string2)END,
CASE WHEN LEN(@string1)<= LEN(@string2) THEN LEN(@string2) ELSE LEN(@string1)END
)) AS b(S1,S2,L1,L2)
GO
Returns:
Lev Similarity
----------- ---------------------------------------
1 0.81818181818
This is an example of how to use the Levenshtein Distance for measuring similarity. There are other algorithms such as the Damerau–Levenshtein distance and The Longest Common Subsequence. Damerau–Levenshtein is more precise but slower (Phil Factor has a Damerau–Levenshtein function in the aforementioned link as well as a [Longest Common Subsequence function] in a different post7. The formula for similarity is the same (L1-DLev)/L2
. The Longest Common Subsequence (LCSSq) is more accurate (but slower) than the longest common substring, but uses the same formula for calculating a similarity score: (LCSSq/L2)
Hopefully this will get you started.