Search code examples
postgresqlfull-text-searchlevenshtein-distance

How to combine PostgreSQL text search with fuzzystrmatch


I'd like to be able to query words from column of type ts_vector but everything which has a levenshtein distance below X should be considered a match.

Something like this where my_table is:

id | my_ts_vector_colum             | sentence_as_text
------------------------------------------------------
1  | 'bananna':3 'tasty':2 'very':1 | Very tasty bananna
2  | 'banaana':2 'yellow':1         | Yellow banaana
3  | 'banana':2 'usual':1           | Usual banana
4  | 'baaaanaaaanaaa':2 'black':1   | Black baaaanaaaanaaa

I want to query something like "Give me id's of all rows, which contain the word banana or words similar to banana where similar means that its Levenshtein distance is less than 4". So the result should be 1, 2 and 3.

I know i can do something like select id from my_table where my_ts_vector_column @@ to_tsquery('banana');, but this would only get me exact matches.

I also know i could do something like select id from my_table where levenshtein(sentence_as_text, 'banana') < 4;, but this would work only on a text column and would work only if the match would contain only the word banana.

But I don't know if or how I could combine the two.

P.S. Table where I want to execute this on contains about 2 million records and the query should be blazing fast (less than 100ms for sure).

P.P.S - I have full control on the table's schema, so changing datatypes, creating new columns, etc would be totally feasible.


Solution

  • 2 million short sentences presumably contains far fewer distinct words than that. But if all your sentences have "creative" spellings, maybe not.

    So you can perhaps create a table of distinct words to search relatively quickly with the unindexed distance function:

    create materialized view words as 
        select distinct unnest(string_to_array(lower(sentence_as_text),' ')) word from my_table;
    

    And create an exact index into the larger table:

    create index on my_table using gin (string_to_array(lower(sentence_as_text),' '));
    

    And then join the together:

    select * from my_table join words 
       ON (ARRAY[word] <@ string_to_array(lower(sentence_as_text),' ')) 
       WHERE levenshtein(word,'banana')<4;