postgresql indexing sqlalchemy full-text-search

Fast string search in PostgreSQL table with slightly erroneous input

Assume I have a user table in PostgreSQL with columns

first_name (PK)
last_name (PK)
email

There are now millions of users in it. One user has the record

(John, Smith, john.smith@gmail.com)

Now I search for him and enter erroneously Johny Smit.

How can I still find the record and this really quick? Is that possible with sqlalchemy as well?

Solution

You can use trigram-based indexing and search included in the pg_trgm extension: demo

create extension pg_trgm;
create index trgm_idx on my_table using GiST ( first_name gist_trgm_ops
                                              ,last_name  gist_trgm_ops);

select * from my_table 
where    first_name % 'Johny' 
  and    last_name  % 'Smit' 
order by last_name <->'Smit'
        ,first_name<->'Johny'
limit 5;

first_name	last_name	email
John	Smith	john.smith@gmail.com

QUERY PLAN
Limit (cost=0.28..8.31 rows=1 width=119) (actual time=19.341..19.353 rows=1 loops=1)
Output: first_name, last_name, email, ((first_name <-> 'Johny'::text)), ((last_name <-> 'Smit'::text))
-> Index Scan using trgm_idx on public.my_table (cost=0.28..8.31 rows=1 width=119) (actual time=19.337..19.348 rows=1 loops=1)
Output: first_name, last_name, email, (first_name <-> 'Johny'::text), (last_name <-> 'Smit'::text)
Index Cond: ((my_table.first_name % 'Johny'::text) AND (my_table.last_name % 'Smit'::text))
Order By: ((my_table.first_name <-> 'Johny'::text) AND (my_table.last_name <-> 'Smit'::text))
Planning Time: 1.364 ms
Execution Time: 19.416 ms

In a real-life scenario it won't be as fast as this because I just buried John Smith under a pile of 100k random uuids.
You will have to tweak your similarity target and still probably search for a number of best matches and somehow pick out of those - there could be many people with names that also match your search.
You might want use an expression index that merges the two fields so that you can deal with just one search phrase and one resulting similarity metric, instead of having to resolve two.

drop index trgm_idx ;
create index trgm_idx2 on my_table 
   using GiST ((first_name||' '||last_name) gist_trgm_ops);

prepare find_john_smith_using_pg_trgm(text) as 
select *,$1 as search_phrase
        , (first_name||' '||last_name)<->  $1 as "<->"
        , (first_name||' '||last_name)<<-> $1 as "<<->"
        , (first_name||' '||last_name)<<<->$1 as "<<<->"
        , word_similarity(first_name||' '||last_name,$1)
        , strict_word_similarity(first_name||' '||last_name,$1)
from my_table 
where    (first_name||' '||last_name) % $1
order by (first_name||' '||last_name)<->$1
limit 5;

execute find_john_smith_using_pg_trgm('Johny Smit');

first_name	last_name	email	search_phrase	<->	<<->	<<<->	word_similarity	strict_word_similarity
John	Smith	john.smith@gmail.com	Johny Smit	0.4285714	0.38461536	0.4285714	0.61538464	0.5714286
Johan	Smittson	johan.smittson@gmail.com	Johny Smit	0.6315789	0.6111111	0.6315789	0.3888889	0.36842105

execute find_john_smith_using_pg_trgm('Johny');

first_name	last_name	email	search_phrase	<->	<<->	<<<->	word_similarity	strict_word_similarity
John	Smith	john.smith@gmail.com	Johny	0.6923077	0.6363636	0.6923077	0.36363637	0.30769232

execute find_john_smith_using_pg_trgm('Smitt');

first_name	last_name	email	search_phrase	<->	<<->	<<<->	word_similarity	strict_word_similarity
Johan	Smittson	johan.smittson@gmail.com	Smitt	0.6875	0.6666666	0.6875	0.33333334	0.3125
John	Smith	john.smith@gmail.com	Smitt	0.6923077	0.6363636	0.6923077	0.36363637	0.30769232

If you do keep them separate, you can establish priorities, and for example order by a weighted average of the two - it's safe to assume the match on last_name has some priority over the match on first_name, especially since last names aren't altered as often (diminutives, abbreviation, nicknames, second names, etc.).