Search code examples
pythonpython-3.xsetgeneratorunique

Getting first n unique elements from Python list


I have a python list where elements can repeat.

>>> a = [1,2,2,3,3,4,5,6]

I want to get the first n unique elements from the list. So, in this case, if i want the first 5 unique elements, they would be:

[1,2,3,4,5]

I have come up with a solution using generators:

def iterate(itr, upper=5):

    count = 0
    for index, element in enumerate(itr):
        if index==0:
            count += 1
            yield element

        elif element not in itr[:index] and count<upper:
            count += 1
            yield element

In use:

>>> i = iterate(a, 5)
>>> [e for e in i]
[1,2,3,4,5]

I have doubts on this being the most optimal solution. Is there an alternative strategy that i can implement to write it in a more pythonic and efficient way?


Solution

  • I would use a set to remember what was seen and return from the generator when you have seen enough:

    a = [1, 2, 2, 3, 3, 4, 5, 6]
        
    def get_unique_N(iterable, N):
        """Yields (in order) the first N unique elements of iterable. 
        Might yield less if data too short."""
        seen = set()
        for e in iterable:
            if e in seen:
                continue
            seen.add(e)
            yield e
            if len(seen) == N:
                return
                
    k = get_unique_N([1, 2, 2, 3, 3, 4, 5, 6], 4)
    print(list(k))
        
    

    Output:

    [1, 2, 3, 4]
    

    According to PEP-479 you should return from generators, not raise StopIteration - thanks to @khelwood & @iBug for that piece of comment - one never learns out.

    With 3.6 you get a deprecated warning, with 3.7 it gives RuntimeErrors: Transition Plan if still using raise StopIteration


    Your solution using elif element not in itr[:index] and count<upper: uses O(k) lookups - with k being the length of the slice - using a set reduces this to O(1) lookups but uses more memory because the set has to be kept as well. It is a speed vs. memory tradeoff - what is better is application/data dependend.

    Consider [1, 2, 3, 4, 4, 4, 4, 5] vs [1] * 1000 + [2] * 1000 + [3] * 1000 + [4] * 1000 + [5] * 1000 + [6]:

    For 6 uniques (in longer list):

    • you would have lookups of O(1)+O(2)+...+O(5001)
    • mine would have 5001*O(1) lookup + memory for set( {1, 2, 3, 4, 5, 6})