Search code examples
postgresqlplpgsqlphoneticssupabase

Phonetic Algorithms for Postgresql


please, I am working on a PoC for Person Real-time Identification, and one of the critical aspects of it is to support both minor misspelling and phonetic variations of First, Middle, and Last name. Like HarinGton == HarrinBton or RaphEAl == RafAEl. It's working for longer names, but it's a bit more imprecise for names like Lee and John.

I am using Double Metaphone through dmetaphone() and dmetaphone_alt() in PostgreSQL 13.3 (Supabase.io). And although I appreciate Double Metaphone it has a (too?) short string as the outcome. metaphone() has parameters to make the resulting phonetic representation longer. I investigated dmetaphone() and couldn't find anything other than the default function.

Is there a way of making dmetaphone() and dmetaphone_alt() return a longer phonetic representation similar to metaphone()'s, but with a ALT variation?.

Any help would be much appreciated.

Thanks


Solution

  • Looking at the postgres docs for these features you don't have parametric control over the length of the encoded string for Double Metaphone. In the case of single Metaphone, you can only truncate the output string:

    max_output_length sets the maximum length of the output metaphone code; if longer, the output is truncated to this length.

    However you may get much better results by using Trigram Similarity or Levenshtein Distance on the encoded output from either of the metaphone methods - this can be a more powerful way to handle phonetic permutations using Metaphones.

    Example

    Consider all the spelling permutations possible for the artist Cyndi Lauper, using double metaphone with trigram similarity we can achieve 100% similarity between the incorrect string cindy lorper and the correct spelling:

    SELECT similarity(dmetaphone('cindy lorper'), dmetaphone('cyndi lauper'));
    

    yields: similarity real: 1 (ie: 100% similarity)

    Which means the encodings are identical for both input strings using Double Metaphone. When using Metaphone, they're slightly different. All of the following yield SNTLRPR

    SELECT metaphone('cyndy lorper',10);
    SELECT metaphone('sinday lorper', 10);
    SELECT metaphone('cinday laurper', 10);
    
    SELECT metaphone('cyndi lauper',10);
    

    yields: SNTLPR which is only one character different to SNTLRPR

    You can also use Levenshtein Distance to calculate it, which gives you a filterable parameter to work with:

    SELECT levenshtein(metaphone('sinday lorper', 10), metaphone('cyndi lauper', 10));
    

    yields: levenshtein integer: 1


    It's working for longer names, but it's a bit more imprecise for names like Lee and John.

    It's a bit difficult to see exactly what you're having trouble with - without a more complete reprex.

    SELECT similarity(dmetaphone('lee'), dmetaphone('leigh'));
    SELECT similarity(dmetaphone('jon'), dmetaphone('john'));
    

    both yield: similarity real: 1 (ie: 100% similarity)

    Edit: here's a easy to follow guide for fuzzy matching with postgres