sql sql-server t-sql search fuzzy-search

Fuzzy phrase similarity in SQL Server

Using SQL Server, how should I perform a fuzzy ranked search across all rows in a large table for similarity to a long phrase on one column?

In other words, if my data looks like this:

Id	Data
1	the quick brown fox jumps over the lazy dog
2	the quick brown cat jumps over the lazy frog
3	the lazy quick brown frog jumps over the cat
4	lorem ipsum dolor sit amet

And I search for "the quick brown ox jumps over a lazy dog", I want results that roughly resemble:

Id	Score
1	95
2	80
3	40
4	0

Actual data would have more rows and longer phrases.

Obviously I don't want an exact string match, so using LIKE or CONTAINS apparently wouldn't work.

Word order matters, so searching each word individually would not work either.

Full-text indices and sounds-like indices seem like they're only useful for sub-string similarity, so I haven't seen a way to apply that to phrase similarity. For example, how could you query this in a way that gives a decent score to a similar phrase that has missing or added words?

I've tested using edit distance (Lavenshtein, Jaro-Winkler, etc), but it's far too slow over a large set of long strings. One query takes several minutes. It sounds like it should only be used on smaller data, so I think a different kind of approach is needed here.

I've seen TFIDF and cosine similarity mentioned, but I'm not sure whether that's the right thing to use here, or how it might be implemented on SQL Server.

Also, CLR support is limited since we're using SQL Server on Linux. It looks like it's allowed as long as it doesn't require unsafe or external permissions.

Solution

A relative fast method to quickly find the best matching string using fuzzy matching logic can be based on counting matching 3-grams in the strings.

It may utilize pre-built sql functions and indexed table which should speed up the search. In particular it does not have to check distance from search string to every string in the data set.

First, for convenience, create a table function that breaks strings into 3-letter tokens.

drop function dbo.get_triplets;
go
CREATE FUNCTION dbo.get_triplets
(   
    @data varchar(1000)
)
RETURNS TABLE AS RETURN 
(
    WITH Nums AS
    (
      SELECT n = ROW_NUMBER() OVER (ORDER BY [object_id])  FROM sys.all_objects 
    )
    select triplet,count(*) c, len(@data)-2 triplet_count
    from (
        select SUBSTRING(@data,n,3) triplet
        from (select top (len(@data)-2) n from nums) n
    ) triplets
    group by triplet
)
GO

Create a string dataset

drop table if exists #data;
select * into #data
from (
values
(1, 'the quick brown fox jumps over the lazy dog'),
(2, 'the quick brown cat jumps over the lazy frog'),
(3, 'the lazy quick brown frog jumps over the cat'),
(4, 'lorem ipsum dolor sit amet')
) a(id,data);

Create an indexed table of 3-letter tokens

drop table if exists #triplets;
select id,triplet,c,triplet_count data_triplet_count 
into #triplets
from #data d
cross apply dbo.get_triplets(d.data);

CREATE unique CLUSTERED INDEX IX_triplet_index ON #triplets(triplet,id);

Then I would expect an efficient fuzzy search of a match to a given string using query similar to

declare @string_to_search varchar(1000) = 'the quick brown ox jumps over a lazy dog';

select  matched.*,d.data,
cast(
cast(matched_triplets as float)
/
cast(case when data_triplet_count>string_triplet_count 
          then data_triplet_count 
          else string_triplet_count
          end as float) 
as decimal(4,3)) score 
from (
    select id,sum(case when a.c<b.c then a.c else b.c end) matched_triplets,
    max(a.data_triplet_count) data_triplet_count,
    max(b.triplet_count) string_triplet_count
    from #triplets a
    join dbo.get_triplets(@string_to_search) b
    on a.triplet = b.triplet
    group by id
) matched
join #data d
on d.id = matched.id;