Search code examples
pythonpython-2.7setduplicatestest-and-set

How to process each item in an ordered list with duplicates only once in Python?


I have an ordered list of things to process that includes some duplicates and I only want to process the first occurrence. Presently, I'm doing it like this in Python v2.7:

seen = set()
for (value, fmt) in formats:
  if fmt not in seen:
    seen.add(fmt)
    process(value, fmt)

Is there anyway to simultaneously insert a new element into seen and detect whether or not it was already present? (This would avoid the repeated lookup of fmt in the set.)

seen = set()
for (value, fmt) in formats:
  # myInsert() would return true if item was not already present.
  if seen.myInsert(fmt):
    process(value, fmt)

Or, alternatively, can I somehow just filter my formats to exclude duplicate entries before looping?

unique_formats = removeDuplicates(formats, key=itemgetter(1))
for (value, fmt) in unique_formats:
  process(value, fmt)

Solution

  • You could take the length of the set before and after the add(). If it didn't change, the format was already in the set.

    seen = set()
    for (value, fmt) in formats:
        l1 = len(seen)
        seen.add(fmt)
        if l1 != len(seen):
             process(value, fmt)
    

    Your question presumes that the in test is a costly operation. This turns out not to be the case. Using len() can take more time, although both are quite fast;

    In [4]: seen = set(range(10000))
    
    In [5]: %timeit 5995 in seen
    10000000 loops, best of 3: 122 ns per loop
    
    In [6]: %timeit len(seen)
    1000000 loops, best of 3: 167 ns per loop
    

    (measured with CPython 2.7.3 on a 2.5 GHz Core Quad Q9300)