Search code examples
hashsetspace-complexity

Are hash sets worse in space complexity if there are a lot of inputs and removals?


Let’s say I have a hash set that I decide will only ever have 3 elements in it. But, I insert and remove, for example, N gibberish words at random while never having more than 3 elements in the set at a time.

Would the space complexity of this set still be o(1), or does the fact that I’m potentially inserting and then removing a million (or whatever) inputs mutate the hash set so that the size of the set is proportional to N in some way?


Solution

  • This will depend on the specific hash table implementation that you’re using, but a “good” implementation of a hash table should have no trouble doing this in O(1) space.

    More specifically:

    1. If you’re using a chained hash table, the table will typically only resize if the load factor (ratio of elements to table slots) exceeds some threshold. If the total number of elements is always at most three, then this condition will never trigger.

    2. If you’re using a cuckoo hash table, as in case (1) the tables will only resize once the load factor is high, so with only three active elements the table should not resize once it’s big enough to hold three items while maintaining its load factor.

    3. For a linear probing table or another open addressing table (quadratic probing, double hashing, etc.), deleted elements are marked with tombstones, which can cause the table to fill up with used slots after multiple elements are removed. However, a reasonable table implementation can easily address this by rebuilding the table at its current size once the number of tombstones is too large. As a result, the size should not depend on the total number of elements that have been inserted in the past.

    4. For a Robin Hood hash table with backward shift deletion, tombstones aren’t needed, and the table size shouldn’t increase.

    This doesn’t mean, however, that all implementations of hash tables would actually do this, though. Rather, it just means that no “standard” hash table would actually need to grow based on the number of elements that have ever been added rather than the max number of elements stored at any one point.

    That being said, I do wonder if hashing is even the right idea here. If you know you’ll only be dealing with three items at a time and are concerned about performance, it might be faster to just make three variables holding your three items and then just check each one.