Search code examples
databasepostgresqlsimilarity

"similarity()" query score is pretty low for search text almost matching


select similarity('GTudH', 'tud'),
       similarity('GTudH', 'gtu'),
       similarity('GTudH', 'gdh')

Above is the query result from Postgres. Can someone explain why the first similarity score is only 0.1, but the third one is 0.25?

Tested the same with snowflake database, which seems like alright:

JAROWINKLER_SIMILARITY('GTUDH','TUD') JAROWINKLER_SIMILARITY('GTUDH','GTU') JAROWINKLER_SIMILARITY('GTUDH','GDH')
86 90 51

Solution

  • That's because the functions are different. PostgreSQL's similarity() function is part of the pg_trgm extension and calculates similarity on the basis of “trigrams” – sequences of three letters that occur in the word. Although the actual algorithm is slightly different (see the source), two words are considered similar if they share many trigrams.

    Let's look at the trigrams of GTudH:

    SELECT show_trgm('GTudH');
    
                show_trgm            
    ═════════════════════════════════
     {"  g"," gt","dh ",gtu,tud,udh}
    (1 row)
    

    pg_trgm prepends two spaces and appends one, for the reason that similarity at the beginning of the word is counted as more significant.

    So GTudH shares one trigram with tud, three trigrams with gtu and two trigrams with gdh, which explains the different results.

    “Similarity” is not a clear-cut concept, and there are many different ways to define it.