I have the following FNAMES
table (it contains about 58k records)
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
I would like to split the table by all the records that contain the same first lette, like this:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 4 | Avi |
| 2 | Dudi |
| 5 | DAVE |
| 3 | Moshiko |
....
For each split I would like to find the record with minimal Jaro–Winkler distance
. It means for every letters that starts with 'a' I'll find the most similar record.
What do I have to change in the following code?
select FNAMES.* , MIN(Jaro–Winkler(FNAMES.NICKNAMES, FNAMES.NICKNAMES))
from FNAMES
LEFT OUTER JOIN FNAMES
ON(true)
WHERE Jaro–Winkler (FNAMES.NICKNAMES, FNAMES.NICKNAMES) <= 4
GROUP BY FNAMES.NICKNAMES
Something like this
select f1.nicknames
,f2.nicknames
from (select f1.nicknames
,f2.nicknames
,rank () over
(
partition by f1.nicknames
order by jaro–winkler(f1.nicknames,f2.nicknames) desc
) as rnk
from fnames f1
left join fnames f2
on substr(f1.nicknames,1,1) =
substr(f2.nicknames,1,1)
where f1.nicknames < f2.nicknames
) t
where rnk = 1