Search code examples
pythonsortinghashsetsortedcontainers

Sorted set searching and ordering


Python Sorted Containers package contains SortedSet, that can be initialized as follows:

__init__(iterable=None, key=None)

key: Function used to extract comparison key from values. Sorted set compares values directly when the key function is none.

I want to store some strings that should be sorted by timestamp. If I store tuples like ("string", 1234), where 1234 is some monotonically increasing number like the difference between now and epoch, how can I search for the string in the set, and also keep it sorted using the 2nd element of the tuple? In other words, the timestamp shouldn't be used for searching, and the string shouldn't be used for ordering.


Solution

  • Keep the timestamps in an auxiliary dict, rather than in the SortedSet itself. For example, something like this:

    import sortedcontainers
    
    class StringTimes:
        def __init__(self):
            self.timestamps = {}
            self.strings = sortedcontainers.SortedSet(key=timestamps.__getitem__)
    
        def add(self, string, time):
            # If the element is already in the set, we need to get rid of it first, so we
            # don't corrupt the ordering when we update the timestamp.
            self.strings.discard(string)
    
            self.timestamps[string] = time
            self.strings.add(string)
    
        def remove(self, string):
            self.strings.remove(string)
            del self.timestamps[string]
    
        def __contains__(self, string):
            return string in self.strings
    
        def __iter__(self):
            return iter(self.strings)
    

    Since we're using a separate dict with string keys, we don't actually need the set part of SortedSet, so we could save some memory and just use a SortedList. (If new timestamps are guaranteed to be monotonically increasing, as your comment suggests, then we don't need sortedcontainers at all. We can just rely on dict ordering. I wouldn't count on that assumption holding, though - it's too likely to break once you introduce multithreading or need to sync updates across a network or something.)