Search code examples
sqloracle-databasesearchduplicatessearch-engine

Oracle Find duplicate records that are similar but aren't exact matches


I'm trying to find a way to search a text field to identify rows that could be similar so I can identify if they are duplicates that should be merged. For example if my data looks like this:

MyText_Column

  1. Bob
  2. Bobby
  3. Robert
  4. Jane
  5. Janey
  6. Janie
  7. Joe
  8. John
  9. Johnathan

A GROUP BY won't work because none of the values are exactly the same, but if I could have a query that would return a list of the likelihood that one row is similar would be great. Maybe there's a better layout but what I am imagining is a result like this:

Query Result

Search Compare Likely_Match

  1. Bob Bobby 96%
  2. Bob Robert 12%
  3. Bob Jane 0%
  4. Bob Janey 0%
  5. .....
  6. Jane Janey 87%
  7. Jane Janie 69%
  8. Jane Joe 12%

Then with result like that I could sort by likelihood and and visually scan to determine if the results are duplicates or not.


Solution

  • The UTL_MATCH package has a couple of methods to do that-- my guess is that you would want to use the Jaro-Winkler similarity algorithm. Something like

    SELECT a.mytext_column search,
           b.mytext_column compare,
           utl_match.jaro_winkler_similarity( a.mytext_column, b.mytext_column ) similarity
      FROM table_name a,
           table_name b
     WHERE a.<<primary key>> != b.<<primary key>>
     ORDER BY utl_match.jaro_winkler_similarity( a.mytext_column, b.mytext_column ) desc
    

    That will generate a result set of N * (N-1) rows which may be unwieldy depending on the number of rows in the original data set. You may want to restrict things by only returning the best matches for a particular search term or only returning the rows that have a similarity score greater than some threshold.