Search code examples
excelvbafuzzy-search

List of items find almost duplicates


Within excel I have a list of artists, songs, edition. This list contains over 15000 records. The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.

So for example some records:

ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official

Each dash indicates a separate column (so 3 columns A, B, C are filled in)

How would I mark them as duplicates within Excel?

I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.

Any regex magic or vba script what can help me out? It'd also be alright to see how much similar the row is (say 80% similar).


Solution

  • One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:

    https://stackoverflow.com/a/4243652/1278553

    From there, you can use the function directly in your spreadsheet to find similarities between instances:

    enter image description here

    You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:

    select
      s1.group, s2.group, s1.song, s2.song,
      levenshtein (s1.group, s2.group) as group_match,
      levenshtein (s1.song, s2.song) as song_match
    from
      songs s1
      cross join songs s2
    order by
      group_match, song_match
    

    Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.