Search code examples
postgresqlsimilaritywordstrigram

Postgres word_similarity not comparing words


"Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the first string is identical to one of the words of the second string)."

That's the definition of word_similarity(a,b), as I understand it, it will look for the WORD a inside the text b, splitting b by words and getting the score of the highest match word.

However, I'm seeing some inconsistencies where the word matching is not really by word, looks like all trigrams are scrambled and compared?

Example:

select word_similarity('sage', 'message sag')

Returns 1, clearly neither 'message' or 'sag' should match with 'sage', but if we combine the possible trigrams from 'message sag', we would then find that all the trigrams from 'sage' would match, but that's not really what should happen since the function description talks about word by word... Is it because both words are next to each other?

The following, will return a 0.6 score:

select word_similarity('sage', 'message test sag') 

Edit: Fiddle to play around http://sqlfiddle.com/#!17/b4bab/1


Solution

  • Function inconsistent with description

    Related subject on pgsql-bugs mailing list.

    The substring similarity algorithm described by the author compares trigram arrays of a query string and a text. The problem is that a trigram array is optimised (duplicated trigrams are eliminated) and loses information about individual words of the text.

    The query illustrates the issue:

    with data(t) as (
    values
        ('message'),
        ('message s'),
        ('message sag'),
        ('message sag sag'),
        ('message sag sage')
    )
    
    select 
        t as "text", 
        show_trgm(t) as "text trigrams", 
        show_trgm('sage') as "string trigrams", 
        cardinality(array_intersect(show_trgm(t), show_trgm('sage'))) as "common trgms"
    from data;
    
           text       |                       text trigrams                       |       string trigrams       | common trgms 
    ------------------+-----------------------------------------------------------+-----------------------------+--------------
     message          | {"  m"," me",age,ess,"ge ",mes,sag,ssa}                   | {"  s"," sa",age,"ge ",sag} |            3
     message s        | {"  m","  s"," me"," s ",age,ess,"ge ",mes,sag,ssa}       | {"  s"," sa",age,"ge ",sag} |            4
     message sag      | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
     message sag sag  | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
     message sag sage | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
    (5 rows)    
    

    The trigram arrays in last three rows are the same and contain all trigrams of the query string.

    Obviously, the implementation is not consistent with the description of the function (the description was changed in later releases of the documentation):

    Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring.


    My function used in the above query:

    create or replace function public.array_intersect(anyarray, anyarray)
    returns anyarray language sql immutable
    as $$
        select case 
            when $1 is null then $2
            else
                array(
                    select unnest($1)
                    intersect
                    select unnest($2)
                )
            end;
    $$;
    

    Workaround

    You can easily write your own function to get more expected results:

    create or replace function my_word_similarity(text, text)
    returns real language sql immutable as $$
        select max(similarity($1, word))
        from regexp_split_to_table($2, '[^[:alnum:]]') word
    $$;
    

    Compare:

    with data(t) as (
    values
        ('message'),
        ('message s'),
        ('message sag'),
        ('message sag sag'),
        ('message sag sage')
    )
    
    select t, word_similarity('sage', t), my_word_similarity('sage', t)
    from data;
    
            t         | word_similarity | my_word_similarity
    ------------------+-----------------+--------------------
     message          |             0.6 |                0.3
     message s        |             0.8 |                0.3
     message sag      |               1 |                0.5
     message sag sag  |               1 |                0.5
     message sag sage |               1 |                  1
    (5 rows)
    

    New function in Postgres 11+

    There is a new function in Postgres 11+ strict_word_similarity() which gives results expected by the author of the question:

    with data(t) as (
    values
        ('message'),
        ('message s'),
        ('message sag'),
        ('message sag sag'),
        ('message sag sage')
    )
    
    select t, word_similarity('sage', t), strict_word_similarity('sage', t)
    from data;
    
            t         | word_similarity | strict_word_similarity
    ------------------+-----------------+------------------------
     message          |             0.6 |                    0.3
     message s        |             0.8 |             0.36363637
     message sag      |               1 |                    0.5
     message sag sag  |               1 |                    0.5
     message sag sage |               1 |                      1
    (5 rows)