Assume I have a user
table in PostgreSQL with columns
first_name (PK)
last_name (PK)
email
There are now millions of users in it. One user has the record
(John, Smith, john.smith@gmail.com)
Now I search for him and enter erroneously Johny Smit
.
How can I still find the record and this really quick? Is that possible with sqlalchemy as well?
You can use trigram-based indexing and search included in the pg_trgm
extension: demo
create extension pg_trgm;
create index trgm_idx on my_table using GiST ( first_name gist_trgm_ops
,last_name gist_trgm_ops);
select * from my_table
where first_name % 'Johny'
and last_name % 'Smit'
order by last_name <->'Smit'
,first_name<->'Johny'
limit 5;
first_name | last_name | |
---|---|---|
John | Smith | john.smith@gmail.com |
QUERY PLAN |
---|
Limit (cost=0.28..8.31 rows=1 width=119) (actual time=19.341..19.353 rows=1 loops=1) |
Output: first_name, last_name, email, ((first_name <-> 'Johny'::text)), ((last_name <-> 'Smit'::text)) |
-> Index Scan using trgm_idx on public.my_table (cost=0.28..8.31 rows=1 width=119) (actual time=19.337..19.348 rows=1 loops=1) |
Output: first_name, last_name, email, (first_name <-> 'Johny'::text), (last_name <-> 'Smit'::text) |
Index Cond: ((my_table.first_name % 'Johny'::text) AND (my_table.last_name % 'Smit'::text)) |
Order By: ((my_table.first_name <-> 'Johny'::text) AND (my_table.last_name <-> 'Smit'::text)) |
Planning Time: 1.364 ms |
Execution Time: 19.416 ms |
John Smith
under a pile of 100k random uuid
s.drop index trgm_idx ;
create index trgm_idx2 on my_table
using GiST ((first_name||' '||last_name) gist_trgm_ops);
prepare find_john_smith_using_pg_trgm(text) as
select *,$1 as search_phrase
, (first_name||' '||last_name)<-> $1 as "<->"
, (first_name||' '||last_name)<<-> $1 as "<<->"
, (first_name||' '||last_name)<<<->$1 as "<<<->"
, word_similarity(first_name||' '||last_name,$1)
, strict_word_similarity(first_name||' '||last_name,$1)
from my_table
where (first_name||' '||last_name) % $1
order by (first_name||' '||last_name)<->$1
limit 5;
execute find_john_smith_using_pg_trgm('Johny Smit');
first_name | last_name | search_phrase | <-> | <<-> | <<<-> | word_similarity | strict_word_similarity | |
---|---|---|---|---|---|---|---|---|
John | Smith | john.smith@gmail.com | Johny Smit | 0.4285714 | 0.38461536 | 0.4285714 | 0.61538464 | 0.5714286 |
Johan | Smittson | johan.smittson@gmail.com | Johny Smit | 0.6315789 | 0.6111111 | 0.6315789 | 0.3888889 | 0.36842105 |
execute find_john_smith_using_pg_trgm('Johny');
first_name | last_name | search_phrase | <-> | <<-> | <<<-> | word_similarity | strict_word_similarity | |
---|---|---|---|---|---|---|---|---|
John | Smith | john.smith@gmail.com | Johny | 0.6923077 | 0.6363636 | 0.6923077 | 0.36363637 | 0.30769232 |
execute find_john_smith_using_pg_trgm('Smitt');
first_name | last_name | search_phrase | <-> | <<-> | <<<-> | word_similarity | strict_word_similarity | |
---|---|---|---|---|---|---|---|---|
Johan | Smittson | johan.smittson@gmail.com | Smitt | 0.6875 | 0.6666666 | 0.6875 | 0.33333334 | 0.3125 |
John | Smith | john.smith@gmail.com | Smitt | 0.6923077 | 0.6363636 | 0.6923077 | 0.36363637 | 0.30769232 |
last_name
has some priority over the match on first_name
, especially since last names aren't altered as often (diminutives, abbreviation, nicknames, second names, etc.).