Search code examples
postgresqlpostgresql-10postgresql-performance

Optimize PostgreSQL query with levenshtein() function


I have a table with approximately 7 million records. The table has a first_name and last_name column which I want to search on using the levenshtein() distance function.

select levenshtein('JOHN', first_name) as fn_distance,
       levenshtein('DOE', last_name) as ln_distance,
       id,
       first_name as "firstName",
       last_name as "lastName"
  from person
 where first_name is not null
   and last_name is not null
   and levenshtein('JOHN', first_name) <= 2
   and levenshtein('DOE', last_name) <= 2
 order by 1, 2
 limit 50;

The above search is slow (4 - 5 secs), what can I do to improve performance? Should a create indexes on the two columns, or something else?

After I added indexes below:

create index first_name_idx on person using gin (first_name gin_trgm_ops);

create index last_name_idx on person using gin(last_name gin_trgm_ops);

The query now takes ~11 secs. :(

New query:

select similarity('JOHN', first_name) as fnsimilarity,
       similarity('DOW', last_name) as lnsimilarity,
       first_name as "firstName",
       last_name as "lastName",
       npi
  from person
 where first_name is not null
   and last_name is not null
   and similarity('JOHN', first_name) >= 0.2
   and similarity('DOW', last_name) >= 0.2
 order by 1 desc, 2 desc, npi
 limit 50;

Solution

  • There is no built-in index type that supports levenshtein distances. I'm not aware of any 3rd party index implementations to do so either.

    Another string similarity measure, trigram similarity, does have an index method to support it. Maybe you can switch to using that measure instead.

    You need to write the query using the % operator, not the similarity function. So it would look something like this:

    set pg_trgm.similarity_threshold TO 0.2;
    select similarity('JOHN', first_name) as fnsimilarity,
           similarity('DOW', last_name) as lnsimilarity,
           first_name as "firstName",
           last_name as "lastName",
           npi
      from person
     where first_name is not null
       and last_name is not null
       and 'JOHN' % first_name
       and 'DOW' % last_name
     order by 1, 2, npi
     limit 50;
    

    But note that 0.2 is very low cutoff, and the lower the cutoff the less efficient the index.