Search code examples
joinhivedistanceimpala

Splitting table using alphabet order in impala/ hive


I have the following FNAMES table (it contains about 58k records)

+------+-------------+
| ID   |   NICKNAMES |
+------+-------------+
|  1   |  Avile      |
|  2   |  Dudi       |
|  3   |  Moshiko    |
|  4   |  Avi        |
|  5   |  DAVE       |
....

I would like to split the table by all the records that contain the same first lette, like this:

+------+-------------+
| ID   |   NICKNAMES |
+------+-------------+
|  1   |  Avile      |
|  4   |  Avi        |

|  2   |  Dudi       |
|  5   |  DAVE       |

|  3   |  Moshiko    |
....

For each split I would like to find the record with minimal Jaro–Winkler distance. It means for every letters that starts with 'a' I'll find the most similar record. What do I have to change in the following code?

select FNAMES.*  ,  MIN(Jaro–Winkler(FNAMES.NICKNAMES, FNAMES.NICKNAMES))
from FNAMES
LEFT OUTER JOIN FNAMES
ON(true)
  WHERE Jaro–Winkler (FNAMES.NICKNAMES, FNAMES.NICKNAMES) <= 4
GROUP BY FNAMES.NICKNAMES

Solution

  • Something like this

    select      f1.nicknames
               ,f2.nicknames
    
    from       (select      f1.nicknames
                           ,f2.nicknames
                           ,rank () over
                            (
                                partition by    f1.nicknames
                                order by        jaro–winkler(f1.nicknames,f2.nicknames) desc
                            ) as rnk
    
                from                    fnames f1 
    
                            left join   fnames f2
    
                            on          substr(f1.nicknames,1,1) = 
                                        substr(f2.nicknames,1,1)
    
                where       f1.nicknames < f2.nicknames
                ) t
    
    where       rnk = 1