Search code examples
sqlpostgresqlfull-text-searchsimilarity

How to find similar words in Full text search on postgresql?


I'm trying to use Full text search on postgresql:

select *
from entertainement
where to_tsvector('simple', name) @@ to_tsquery('simple', 'word_to_search:*')

This query works well and give me what I want to display.However I found on some websites when I enter a word that is not found it shows me No result found for 'word_to_search' and give me some other propositions to some words similar to it. For example if I put the word activityng I got

No result found for activityng

But it gives me some propositions containing word activity. However when I put the word activityns I got :

No result found for activityns

But I got some propositions containing the word activities. I didn't understand the logic of it because I thnik activityns is similiar to activity than to activities. I tried to know the similiarity of this word using similarity of pg_trgm and I got:

select similarity('activity','activityns');
similarity: 0,6666667

select similarity('activities','activityns');
similarity: 0,4666667

Is there any other solutions to detect similarity between the words and gives more precise results?


Solution

  • FTS first reduce the token to lexeme and then compare, trigram compares three letters - you can't compare comparison results with so different algorithms, hereis example for FTS (showing why one is closerto another in your sample):

    t=# with w(v) as (values('activityns'),('activity'),('activities'),('activit'))
    select to_tsvector(v),v, to_tsvector(v) @@ to_tsquery('activ:*'),to_tsvector(v) @@ to_tsquery('activity'),to_tsvector(v) @@ to_tsquery('activit:*') from w;
      to_tsvector  |     v      | ?column? | ?column? | ?column?
    ---------------+------------+----------+----------+----------
     'activityn':1 | activityns | t        | f        | t
     'activ':1     | activity   | t        | t        | f
     'activ':1     | activities | t        | t        | f
     'activit':1   | activit    | t        | f        | t
    (4 rows)
    

    look to which lexeme the word is reduced (first column) and take a look at what results give you wildcard usage depending on lexeme to which the word is reduced (3,4,5) columns