Search code examples
pythonpython-3.xsetfifo

Is there some built in facility in Python that provides a set with a maximum size and that deletes oldest entries after max size is reached?


I have a very specific need: a set-like implementation (for O(1) member checking) which allows a maximum number of elements to be added to it, and that many have been added new items would cause the oldest (in the order they have been added to it) from being replaced / deleted. Say, FIFO / First In First Out.

Is there such a thing in Python, either built-in or in a library?

I searched but no luck...

My use case is that I have a steady number of hashes coming in, and I need to detect duplicates. But of course memory is limited and in my case older hashes have lesser value, so after adding new ones to this given set I would like its size to be capped, and older hashes to be replaced in this set as newer ones come in so that the size stays constant from that point on.

Any pointers will be very helpful, thank you!


Solution

  • As requested, my comment in the form of an answer.

    Using both is the canonical solution for this problem. Add new elements to one end of the deque and to the set. Test existence of elements from the set. And if more than n elements are in, pop one element from the other end of the deque, and also from the set (once we known which one it is, thanks to the pop from deque)

    So, that would be something like

    from collections import deque
    class FixedSizeSet:
        def __init__(self, N):
            self.N=N
            self.theSet=set()
            self.q=deque()
    
        def add(self, elt):
            if elt in self.theSet: return
            self.theSet.add(elt)
            self.q.append(elt)
            while len(self.q)>self.N:
                old=self.q.popleft()
                self.theSet.remove(old)
    
        def remove(self, elt):
            self.q.remove(elt)   # That the weak part. But your usage didn't mention any explicit remove. 
                                 # That explicit remove is O(n)
                                 # whereas the implicit one (the one that occurs because you added something) is O(1) 
            self.theSet.remove(elt)
    
        def __contains__(self, elt):
            return elt in self.theSet
                
    
        def __repr__(self):
            return self.theSet.__repr__()
    
    
    
    s=FixedSizeSet(3)
    s.add(1)
    s.add(2)
    s.add(3)
    s.add(4)
    1 in s
    # False
    2 in s
    # True
    4 in s 
    # True
    

    Cost of adding is indeed O(1), apparently: cost of adding n×10000 (so from n=10000 to n=10000000) elements to the FixedSizeSet(10000)

    enter image description here

    (The very fast ramp up at the beginning, which is reproducible, not just noise, comes from the fact that testing a set is not really O(1), even if python's doc says otherwise.)

    But that is not surprising. Of course it is O(1), whatever the implementation, since the size is limited (once we've added N elements, it is always the same cost: cost of adding 1 more thing to a collection of N things, then removing 1 thing from a collection of N+1 things)

    If multiply N (N≡the max size of the set) by 10, I get enter image description here

    So, it doesn't depend on N neither. Which is more interesting.