select similarity('GTudH', 'tud'),
similarity('GTudH', 'gtu'),
similarity('GTudH', 'gdh')
Above is the query result from Postgres. Can someone explain why the first similarity score is only 0.1, but the third one is 0.25?
Tested the same with snowflake database, which seems like alright:
JAROWINKLER_SIMILARITY('GTUDH','TUD') | JAROWINKLER_SIMILARITY('GTUDH','GTU') | JAROWINKLER_SIMILARITY('GTUDH','GDH') |
---|---|---|
86 | 90 | 51 |
That's because the functions are different. PostgreSQL's similarity()
function is part of the pg_trgm
extension and calculates similarity on the basis of “trigrams” – sequences of three letters that occur in the word. Although the actual algorithm is slightly different (see the source), two words are considered similar if they share many trigrams.
Let's look at the trigrams of GTudH
:
SELECT show_trgm('GTudH');
show_trgm
═════════════════════════════════
{" g"," gt","dh ",gtu,tud,udh}
(1 row)
pg_trgm
prepends two spaces and appends one, for the reason that similarity at the beginning of the word is counted as more significant.
So GTudH
shares one trigram with tud
, three trigrams with gtu
and two trigrams with gdh
, which explains the different results.
“Similarity” is not a clear-cut concept, and there are many different ways to define it.