Search code examples
pythonalgorithmperformancelisttime-complexity

Efficiently finding duplicates in a list


I have the function below which searches an array for duplicate entries and then returns a list of the duplicates. I'd like to speed this piece of my code up, can anyone suggest a more efficient way?

Code:

def findDupe(array):
    dupelist = []
    for i in range(len(array)):
        for j in range(len(array)):
            comp1 = array[i]
            comp2 = array[j]
            if comp1 == comp2 and i!=j:
                if comp2 not in dupelist:
                    dupelist.append(comp2)
    return dupelist

Solution

  • The idea here is to do a single sweep in linear time. You can use a counter to do this. The idea is to count each element and then return all those which were counted multiple times.

    from collections import Counter
    
    def get_duplicates(array):
        c = Counter(array)
        return [k for k in c if c[k] > 1] 
    

    The approach above is linear in complexity, but makes two passes over the input - once, to count (this is abstracted away by the Counter constructor) and the other is to return non-unique values in the list comp. You can optimise this a bit using a yield statement, and return duplicates as you find them.

    def get_duplicates(array):
        c = Counter()
        seen = set()
        for i in array: 
            c[i] += 1
            if c[i] > 1 and i not in seen:
                seen.add(i)
                yield i
    

    This results in a compulsory if check and extra space in the form of a set, but you reduce two passes to one.