I have a table with approximately 7 million records. The table has a first_name and last_name column which I want to search on using the levenshtein() distance function.
select levenshtein('JOHN', first_name) as fn_distance,
levenshtein('DOE', last_name) as ln_distance,
id,
first_name as "firstName",
last_name as "lastName"
from person
where first_name is not null
and last_name is not null
and levenshtein('JOHN', first_name) <= 2
and levenshtein('DOE', last_name) <= 2
order by 1, 2
limit 50;
The above search is slow (4 - 5 secs), what can I do to improve performance? Should a create indexes on the two columns, or something else?
After I added indexes below:
create index first_name_idx on person using gin (first_name gin_trgm_ops);
create index last_name_idx on person using gin(last_name gin_trgm_ops);
The query now takes ~11 secs. :(
New query:
select similarity('JOHN', first_name) as fnsimilarity,
similarity('DOW', last_name) as lnsimilarity,
first_name as "firstName",
last_name as "lastName",
npi
from person
where first_name is not null
and last_name is not null
and similarity('JOHN', first_name) >= 0.2
and similarity('DOW', last_name) >= 0.2
order by 1 desc, 2 desc, npi
limit 50;
There is no built-in index type that supports levenshtein distances. I'm not aware of any 3rd party index implementations to do so either.
Another string similarity measure, trigram similarity, does have an index method to support it. Maybe you can switch to using that measure instead.
You need to write the query using the % operator, not the similarity function. So it would look something like this:
set pg_trgm.similarity_threshold TO 0.2;
select similarity('JOHN', first_name) as fnsimilarity,
similarity('DOW', last_name) as lnsimilarity,
first_name as "firstName",
last_name as "lastName",
npi
from person
where first_name is not null
and last_name is not null
and 'JOHN' % first_name
and 'DOW' % last_name
order by 1, 2, npi
limit 50;
But note that 0.2 is very low cutoff, and the lower the cutoff the less efficient the index.