sql snowflake-cloud-data-platform levenshtein-distance

How can I find the variation in strings in a single column using Snowflake SQL?

Say I have a table like this:

Person1	Person2
Dave	Fred
Dave	Dave
Dave	Mike
Fred	Dave
Dave	Mike
Dave	Jeff

In column 'Person1' clearly Dave is the most popular input, so I'd like to produce a 'similarity score' or 'variation within column' score that would reflect that in SQL (Snowflake).

In contrast, for the column 'Person2' there is more variation between the strings and so the similarity score would be lower, or variation within column higher. So you might end up with a similarity score output as something like: 'Person1': 0.9, 'Person2': 0.4.

If this is just row-wise Levenshtein Distance (LD), how can I push EDITDISTANCE across these to get a score for each column please? At the moment I can only see how to get the LD between 'Person1' and 'Person2', rather than within 'Person1' and 'Person2'.

Many thanks

Solution

You proposed values of 0.9 and 0.4 seem like ratio's of sameness, so that can be calculated with a count and ratio_of_report like so:


with data(person1, person2) as (
    select * from values
        ('Dave','Fred'),
        ('Dave','Dave'),
        ('Dave','Mike'),
        ('Fred','Dave'),
        ('Dave','Mike'),
        ('Dave','Jeff')
), p1 as (
    select 
      person1 
      ,count(*) as c_p1
      ,ratio_to_report(c_p1) over () as q
    from data
    group by 1
    qualify row_number() over(order by c_p1 desc) = 1
), p2 as (
    select 
      person2 
      ,count(*) as c_p2
      ,ratio_to_report(c_p2) over () as q
    from data
    group by 1
    qualify row_number() over(order by c_p2 desc) = 1
)
select 
    p1.q as p1_same,
    p2.q as p2_same
from p1
cross join p2
;

giving:

P1_SAME	P2_SAME
0.833333	0.333333

Editdistance:

So using a full cross join, we can calculate the editdistance of all values, and find the ratio of this to the total count:

with data(person1, person2) as (
    select * from values
        ('Dave','Fred'),
        ('Dave','Dave'),
        ('Dave','Mike'),
        ('Fred','Dave'),
        ('Dave','Mike'),
        ('Dave','Jeff')
), combo as (
select 
    editdistance(da.person1, db.person1) as p1_dist    
    ,editdistance(da.person2, db.person2) as p2_dist
from data as da
cross join data as db
)
select count(*) as c
    ,sum(p1_dist) as s_p1_dist
    ,sum(p2_dist) as s_p2_dist
    ,c / s_p1_dist as p1_same
    ,c / s_p2_dist as p2_same 
from combo
 ;

But given editdistance gives a result of zero for same and positive value for difference, the scaling of these does not align with the desired result...

JAROWINKLER_SIMILARITY:

Given the Jarowinklet similarity result is already scaled between 0 - 100, it makes more sense to be able to average this..

select 
     avg(JAROWINKLER_SIMILARITY(da.person1, db.person1)/100) as p1_dist    
    ,avg(JAROWINKLER_SIMILARITY(da.person2, db.person2)/100) as p2_dist
from data as da
cross join data as db;

P1_DIST	P2_DIST
0.861111111111	0.527777777778