Search code examples
postgresqlsimilarity

A good way to search for similar words in PostgreSQL


I am using PostgreSQL 10 and, even though I have some experience writing queries, I don't now that much about DBMSs. I want that my query be more "intelligent" and not just match exact words. I made a humble research, and it seems that to accomplish this we have many features. First, I thought about using RegEx on the query, but it seems not a good idea, because RegEx are fairly limited. I am sure that Postgres has some better feature more suitable for this.

To implement my smart search I came up with an idea like this: The user can forget a letter in each word or have repeated letters that will not cause a mismatch. Also letters like ç and c should match. This idea is not the perfect reproduction of the problem that I have to solve I am only exposing my problem.

For example:

If a user searches for "macs vermelhas". I should return results containing "macas vermelhas", "maças vermelhas", "macs vermelhas", "macs vrmelhas".

The system is in Portuguese and I have pretty much the same search challenges of Spanish language (ê, ü, ã and etc).


Solution

  • I think that trigram indexes are a perfect fit for you.

    CREATE EXTENSION pg_trgm;
    

    Let's have a table like that:

    CREATE TABLE words (word text);
    
    INSERT INTO words VALUES ('maças vermelhas');
    INSERT INTO words VALUES ('alguna cosa');
    
    CREATE INDEX ON words USING gin (word gin_trgm_ops);
    

    Now you can find words by similarity:

    SELECT * FROM words WHERE word % 'macs vermelhas';
    
          word       
    -----------------
     maças vermelhas
    (1 row)
    

    The index can be used (I force it in this case):

    SET enable_seqscan=off;
    
    EXPLAIN (COSTS OFF)
       SELECT * FROM words WHERE word % 'macs vermelhas';
    
                         QUERY PLAN                      
    -----------------------------------------------------
     Bitmap Heap Scan on words
       Recheck Cond: (word % 'macs vermelhas'::text)
       ->  Bitmap Index Scan on words_word_idx
             Index Cond: (word % 'macs vermelhas'::text)
    (4 rows)