I'm trying to find a way to search a text field to identify rows that could be similar so I can identify if they are duplicates that should be merged. For example if my data looks like this:
MyText_Column
A GROUP BY won't work because none of the values are exactly the same, but if I could have a query that would return a list of the likelihood that one row is similar would be great. Maybe there's a better layout but what I am imagining is a result like this:
Query Result
Search Compare Likely_Match
Then with result like that I could sort by likelihood and and visually scan to determine if the results are duplicates or not.
The UTL_MATCH package has a couple of methods to do that-- my guess is that you would want to use the Jaro-Winkler similarity algorithm. Something like
SELECT a.mytext_column search,
b.mytext_column compare,
utl_match.jaro_winkler_similarity( a.mytext_column, b.mytext_column ) similarity
FROM table_name a,
table_name b
WHERE a.<<primary key>> != b.<<primary key>>
ORDER BY utl_match.jaro_winkler_similarity( a.mytext_column, b.mytext_column ) desc
That will generate a result set of N * (N-1) rows which may be unwieldy depending on the number of rows in the original data set. You may want to restrict things by only returning the best matches for a particular search term or only returning the rows that have a similarity score greater than some threshold.