sql oracle-database search duplicates search-engine

Oracle Find duplicate records that are similar but aren't exact matches

I'm trying to find a way to search a text field to identify rows that could be similar so I can identify if they are duplicates that should be merged. For example if my data looks like this:

MyText_Column

Bob
Bobby
Robert
Jane
Janey
Janie
Joe
John
Johnathan

A GROUP BY won't work because none of the values are exactly the same, but if I could have a query that would return a list of the likelihood that one row is similar would be great. Maybe there's a better layout but what I am imagining is a result like this:

Query Result

Search Compare Likely_Match

Bob Bobby 96%
Bob Robert 12%
Bob Jane 0%
Bob Janey 0%
.....
Jane Janey 87%
Jane Janie 69%
Jane Joe 12%

Then with result like that I could sort by likelihood and and visually scan to determine if the results are duplicates or not.

Solution

The UTL_MATCH package has a couple of methods to do that-- my guess is that you would want to use the Jaro-Winkler similarity algorithm. Something like

SELECT a.mytext_column search,
       b.mytext_column compare,
       utl_match.jaro_winkler_similarity( a.mytext_column, b.mytext_column ) similarity
  FROM table_name a,
       table_name b
 WHERE a.<<primary key>> != b.<<primary key>>
 ORDER BY utl_match.jaro_winkler_similarity( a.mytext_column, b.mytext_column ) desc

That will generate a result set of N * (N-1) rows which may be unwieldy depending on the number of rows in the original data set. You may want to restrict things by only returning the best matches for a particular search term or only returning the rows that have a similarity score greater than some threshold.