Say I have a table like this:
Person1 | Person2 |
---|---|
Dave | Fred |
Dave | Dave |
Dave | Mike |
Fred | Dave |
Dave | Mike |
Dave | Jeff |
In column 'Person1' clearly Dave is the most popular input, so I'd like to produce a 'similarity score' or 'variation within column' score that would reflect that in SQL (Snowflake).
In contrast, for the column 'Person2' there is more variation between the strings and so the similarity score would be lower, or variation within column higher. So you might end up with a similarity score output as something like: 'Person1': 0.9, 'Person2': 0.4.
If this is just row-wise Levenshtein Distance (LD), how can I push EDITDISTANCE across these to get a score for each column please? At the moment I can only see how to get the LD between 'Person1' and 'Person2', rather than within 'Person1' and 'Person2'.
Many thanks
You proposed values of 0.9 and 0.4 seem like ratio's of sameness, so that can be calculated with a count and ratio_of_report like so:
with data(person1, person2) as (
select * from values
('Dave','Fred'),
('Dave','Dave'),
('Dave','Mike'),
('Fred','Dave'),
('Dave','Mike'),
('Dave','Jeff')
), p1 as (
select
person1
,count(*) as c_p1
,ratio_to_report(c_p1) over () as q
from data
group by 1
qualify row_number() over(order by c_p1 desc) = 1
), p2 as (
select
person2
,count(*) as c_p2
,ratio_to_report(c_p2) over () as q
from data
group by 1
qualify row_number() over(order by c_p2 desc) = 1
)
select
p1.q as p1_same,
p2.q as p2_same
from p1
cross join p2
;
giving:
P1_SAME | P2_SAME |
---|---|
0.833333 | 0.333333 |
So using a full cross join, we can calculate the editdistance of all values, and find the ratio of this to the total count:
with data(person1, person2) as (
select * from values
('Dave','Fred'),
('Dave','Dave'),
('Dave','Mike'),
('Fred','Dave'),
('Dave','Mike'),
('Dave','Jeff')
), combo as (
select
editdistance(da.person1, db.person1) as p1_dist
,editdistance(da.person2, db.person2) as p2_dist
from data as da
cross join data as db
)
select count(*) as c
,sum(p1_dist) as s_p1_dist
,sum(p2_dist) as s_p2_dist
,c / s_p1_dist as p1_same
,c / s_p2_dist as p2_same
from combo
;
But given editdistance gives a result of zero for same and positive value for difference, the scaling of these does not align with the desired result...
Given the Jarowinklet similarity result is already scaled between 0 - 100, it makes more sense to be able to average this..
select
avg(JAROWINKLER_SIMILARITY(da.person1, db.person1)/100) as p1_dist
,avg(JAROWINKLER_SIMILARITY(da.person2, db.person2)/100) as p2_dist
from data as da
cross join data as db;
P1_DIST | P2_DIST |
---|---|
0.861111111111 | 0.527777777778 |