Search code examples
pythondictionaryscriptingdata-management

2D "array" or object python for filtering duplicates


I'm trying to filter user duplicates from a database. There's a unique user_id and the full name. I'm comparing the names using difflib.get_close_matches

Now as the names aren't unique, I created a dictionary with the user_id as key and the name as related object. But comparing names like this requires to iterate over the full dictionary every time and accessing the names is kind of a pain.

I was thinking about just using a 2d-array (list) as it's quicker to get the data, but I don't really want to work with indexes (Imho it's a pretty ugly way to deal with the problem). Any suggestions on how to solve this issue in an elegant way are highly appreciated. I'm still learning python btw.

Edit: The dataset looks like this:


user_id  name

4050 John Doe
4059 John doe
4052 John Doe1 
9083 Napoleon Bonnaparte
7842 Mad Max
4085 Johnn Doe
4084 Alice Spring
5673 Fredy Krüger
4092 Alice Spring1
4042 Alice k Spring
4122 Max miller

In the end I need to find the user_ids for the names which are similary, that's why I am using difflib.get_close_matches So the list should look like the following in the end:


user_id  name


4050 John Doe
4059 John doe
4052 John Doe1 
4085 Johnn Doe
4084 Alice Spring
4092 Alice Spring1
4042 Alice k Spring

Solution

  • It looks to me like you really want to go from name to id and not the other way around. The way to tackle the issue of full names not necessarily being unique is to have a list of user_ids against each full name. So, reverse your dictionary that has the user_id as key and the name as related object. Like this:

    from collections import defaultdict
    lookup = defaultdict(list)
    for id, name in mydict.items():
        lookup[name].append(id)
    

    Now build a dict of close matches using difflib.get_close_matches(): key is full name, value is a list of potentially duplicate full names. It appears from your question that you already know how to do that.

    Loop through your dict of close matches and report full name and id:

    for name, duplicate_list in close_matches.items():
        for id in lookup[name]:
            print (id, name)
            for duplicate in duplicate_list:
                for id in lookup[duplicate]:
                    if duplicate != name:
                        print(id, duplicate, "possible duplicate of", name)
    

    I've put a print() call here for simplicity but you will almost certainly want to assemble the results into a list for further processing.