Search code examples
sqlsnowflake-cloud-data-platformlevenshtein-distance

How can I find the variation in strings in a single column using Snowflake SQL?


Say I have a table like this:

Person1 Person2
Dave Fred
Dave Dave
Dave Mike
Fred Dave
Dave Mike
Dave Jeff

In column 'Person1' clearly Dave is the most popular input, so I'd like to produce a 'similarity score' or 'variation within column' score that would reflect that in SQL (Snowflake).

In contrast, for the column 'Person2' there is more variation between the strings and so the similarity score would be lower, or variation within column higher. So you might end up with a similarity score output as something like: 'Person1': 0.9, 'Person2': 0.4.

If this is just row-wise Levenshtein Distance (LD), how can I push EDITDISTANCE across these to get a score for each column please? At the moment I can only see how to get the LD between 'Person1' and 'Person2', rather than within 'Person1' and 'Person2'.

Many thanks


Solution

  • You proposed values of 0.9 and 0.4 seem like ratio's of sameness, so that can be calculated with a count and ratio_of_report like so:

    
    with data(person1, person2) as (
        select * from values
            ('Dave','Fred'),
            ('Dave','Dave'),
            ('Dave','Mike'),
            ('Fred','Dave'),
            ('Dave','Mike'),
            ('Dave','Jeff')
    ), p1 as (
        select 
          person1 
          ,count(*) as c_p1
          ,ratio_to_report(c_p1) over () as q
        from data
        group by 1
        qualify row_number() over(order by c_p1 desc) = 1
    ), p2 as (
        select 
          person2 
          ,count(*) as c_p2
          ,ratio_to_report(c_p2) over () as q
        from data
        group by 1
        qualify row_number() over(order by c_p2 desc) = 1
    )
    select 
        p1.q as p1_same,
        p2.q as p2_same
    from p1
    cross join p2
    ;
    

    giving:

    P1_SAME P2_SAME
    0.833333 0.333333

    Editdistance:

    So using a full cross join, we can calculate the editdistance of all values, and find the ratio of this to the total count:

    with data(person1, person2) as (
        select * from values
            ('Dave','Fred'),
            ('Dave','Dave'),
            ('Dave','Mike'),
            ('Fred','Dave'),
            ('Dave','Mike'),
            ('Dave','Jeff')
    ), combo as (
    select 
        editdistance(da.person1, db.person1) as p1_dist    
        ,editdistance(da.person2, db.person2) as p2_dist
    from data as da
    cross join data as db
    )
    select count(*) as c
        ,sum(p1_dist) as s_p1_dist
        ,sum(p2_dist) as s_p2_dist
        ,c / s_p1_dist as p1_same
        ,c / s_p2_dist as p2_same 
    from combo
     ;
    

    But given editdistance gives a result of zero for same and positive value for difference, the scaling of these does not align with the desired result...

    JAROWINKLER_SIMILARITY:

    Given the Jarowinklet similarity result is already scaled between 0 - 100, it makes more sense to be able to average this..

    select 
         avg(JAROWINKLER_SIMILARITY(da.person1, db.person1)/100) as p1_dist    
        ,avg(JAROWINKLER_SIMILARITY(da.person2, db.person2)/100) as p2_dist
    from data as da
    cross join data as db;
    
    P1_DIST P2_DIST
    0.861111111111 0.527777777778