Search code examples
pythonlistobjectunique

Find the sub-set of unique objects in a list which contains duplicate references to the objects


I have an algorithm that creates a set of lists in the values of a dictionary. However, the number of lists is less than the number of dictionary keys because some of the values are references to the same list object.

When the algorithm is complete, I want to extract a list of only the remaining unique list objects. I want to avoid having to compare two lists since this is inefficient.

I came up with this way to do it using the id function. Maybe this is fine but I wasn't sure if this is an appropriate use of id and wondering if there's a simpler way to do it.

# Start with a full set of unique lists
groups = {i: [i] for i in range(1, 6)}

# Algorithm joins some of the lists together which 
# reduces the total number
for a, b in [(1, 2), (2, 4)]:
    groups[a] += groups[b]
    groups[b] = groups[a]

print(groups)

Output:

{1: [1, 2, 4], 2: [1, 2, 4], 3: [3], 4: [1, 2, 4], 5: [5]}

Note: Now some of the values of the dictionary contain the same list, [1, 2, 4].

# Find the remaining unique lists
all_values = list(groups.values())
ids = [id(x) for x in all_values]
result = [all_values[ids.index(a)] for a in set(ids)]

print(result)

Output:

[[1, 2, 4], [3], [5]]

This is a common question in other languages but I couldn't find one on how to do this in Python.


Solution

  • What you have is a dictionary where values (lists) may be duplicated.

    Sets are useful for reducing duplicate data. However, as lists are not hashable you need to convert them to something else - e.g., a tuple

    You could do this:

    groups = {i: [i] for i in range(1, 6)}
    
    # Algorithm joins some of the lists together which 
    # reduces the total number
    for a, b in [(1, 2), (2, 4)]:
        groups[a] += groups[b]
        groups[b] = groups[a]
    
    print(groups)
    
    result = list(map(list, {*(tuple(e) for e in groups.values())}))
    
    print(result)
    

    Output:

    [[3], [1, 2, 4], [5]]
    

    A potential disadvantage of this technique is that the output order may not be what's required

    Note:

    Just to emphasise the id v. hash dilemma, consider this:

    a=(1,2,3,4,5)           
    b=(1,2,3,4,5)
                       
    id(a)              
    4429940832
    
    id(b)           
    4429952432
    
    hash(a)           
    -5659871693760987716
    
    hash(b) 
    -5659871693760987716