Search code examples
postgresqlindexingsqlalchemyfull-text-search

Fast string search in PostgreSQL table with slightly erroneous input


Assume I have a user table in PostgreSQL with columns

first_name (PK)
last_name (PK)
email

There are now millions of users in it. One user has the record

(John, Smith, john.smith@gmail.com)

Now I search for him and enter erroneously Johny Smit.

How can I still find the record and this really quick? Is that possible with sqlalchemy as well?


Solution

  • You can use trigram-based indexing and search included in the pg_trgm extension: demo

    create extension pg_trgm;
    create index trgm_idx on my_table using GiST ( first_name gist_trgm_ops
                                                  ,last_name  gist_trgm_ops);
    
    select * from my_table 
    where    first_name % 'Johny' 
      and    last_name  % 'Smit' 
    order by last_name <->'Smit'
            ,first_name<->'Johny'
    limit 5;
    
    first_name last_name email
    John Smith john.smith@gmail.com
    QUERY PLAN
    Limit (cost=0.28..8.31 rows=1 width=119) (actual time=19.341..19.353 rows=1 loops=1)
      Output: first_name, last_name, email, ((first_name <-> 'Johny'::text)), ((last_name <-> 'Smit'::text))
      -> Index Scan using trgm_idx on public.my_table (cost=0.28..8.31 rows=1 width=119) (actual time=19.337..19.348 rows=1 loops=1)
            Output: first_name, last_name, email, (first_name <-> 'Johny'::text), (last_name <-> 'Smit'::text)
            Index Cond: ((my_table.first_name % 'Johny'::text) AND (my_table.last_name % 'Smit'::text))
            Order By: ((my_table.first_name <-> 'Johny'::text) AND (my_table.last_name <-> 'Smit'::text))
    Planning Time: 1.364 ms
    Execution Time: 19.416 ms
    1. In a real-life scenario it won't be as fast as this because I just buried John Smith under a pile of 100k random uuids.
    2. You will have to tweak your similarity target and still probably search for a number of best matches and somehow pick out of those - there could be many people with names that also match your search.
    3. You might want use an expression index that merges the two fields so that you can deal with just one search phrase and one resulting similarity metric, instead of having to resolve two.
    drop index trgm_idx ;
    create index trgm_idx2 on my_table 
       using GiST ((first_name||' '||last_name) gist_trgm_ops);
    
    prepare find_john_smith_using_pg_trgm(text) as 
    select *,$1 as search_phrase
            , (first_name||' '||last_name)<->  $1 as "<->"
            , (first_name||' '||last_name)<<-> $1 as "<<->"
            , (first_name||' '||last_name)<<<->$1 as "<<<->"
            , word_similarity(first_name||' '||last_name,$1)
            , strict_word_similarity(first_name||' '||last_name,$1)
    from my_table 
    where    (first_name||' '||last_name) % $1
    order by (first_name||' '||last_name)<->$1
    limit 5;
    
    execute find_john_smith_using_pg_trgm('Johny Smit');
    
    first_name last_name email search_phrase <-> <<-> <<<-> word_similarity strict_word_similarity
    John Smith john.smith@gmail.com Johny Smit 0.4285714 0.38461536 0.4285714 0.61538464 0.5714286
    Johan Smittson johan.smittson@gmail.com Johny Smit 0.6315789 0.6111111 0.6315789 0.3888889 0.36842105
    execute find_john_smith_using_pg_trgm('Johny');
    
    first_name last_name email search_phrase <-> <<-> <<<-> word_similarity strict_word_similarity
    John Smith john.smith@gmail.com Johny 0.6923077 0.6363636 0.6923077 0.36363637 0.30769232
    execute find_john_smith_using_pg_trgm('Smitt');
    
    first_name last_name email search_phrase <-> <<-> <<<-> word_similarity strict_word_similarity
    Johan Smittson johan.smittson@gmail.com Smitt 0.6875 0.6666666 0.6875 0.33333334 0.3125
    John Smith john.smith@gmail.com Smitt 0.6923077 0.6363636 0.6923077 0.36363637 0.30769232
    1. If you do keep them separate, you can establish priorities, and for example order by a weighted average of the two - it's safe to assume the match on last_name has some priority over the match on first_name, especially since last names aren't altered as often (diminutives, abbreviation, nicknames, second names, etc.).