Search code examples
c++hashhashtableunordered

ordering of elements in hashtable of unordered set


I am confused about the hashtable in unordered containers... or at least in sets.

So, I thought it would work like this:

I hash my object. I calculate the modulo of object and vector lengths (hash%vectorlength) and use that as the index for the pointer to my object in the hash table.. which is a vector as far as I know...

so for a naive hash function that returns for a int wrapper just the int member's value it would look like this:

hash table:

vector index:           [0, 1, 2, 3, 4]
                         |  |  |  |  |
object with value...:    0  1  2  3  4

I wrote a program to test that:

#include <iostream>
#include <unordered_set>

struct Obj
{

public:

    Obj(int i)
    {
        mem = i;
    }

    friend bool operator==(const Obj& o1, const Obj& o2)
    {
        return (o1.mem == o2.mem);
    }

    friend std::ostream& operator<<(std::ostream& o, const Obj& obj)
    {
        o << obj.mem;
        return o;
    }

    int mem;

};

namespace std
{
    template<>
    struct hash<Obj>
    {
        size_t operator()(const Obj& r) const
        {
            size_t hash = r.mem;
            return hash;
        }
    };
}


int main()
{
    std::unordered_set<Obj> map;
    for (int i = 0; i < 5; ++i)
    {
        map.insert(Obj(i));
    }

    for(auto it = map.begin(); it != map.end(); ++it)
    {
        std::cout << (*it) << std::endl;
    }
}

I expected the output

0
1
2
3
4

but I got:

4
3
2
1
0

Why is that?


Solution

  • While it's true the Standard Library implementations can do whatever they like, it's also interesting to see where your assumptions - and those expressed in several comments - correspond with an actual implementation.

    I could reproduce your non "0 1 2 3 4" result with GCC, though only by adding a map.reserve(6) or more (weirdly, 5 produced "4 0 1 2 3").

    The details below simply explain the behaviour of the GCC version I looked at....

    Digging for an explanation, I first sanity-checked whether the logical buckets contained the hash-function-implied content:

    for (size_t i = 0; i < map.bucket_count(); ++i)
    {
        std::cout << '[' << i << ']';
        for (auto it = map.begin(i); it != map.end(i); ++it)
            std::cout << ' ' << *it;
        std::cout << '\n';
    }
    

    And, indeed they did:

    [0] 0
    [1] 1
    [2] 2
    [3] 3
    [4] 4
    [5]
    [6]
    

    So, the comment suggesting that "the Standard Library is free to apply any invertible function on top of your hash function, and no guarantee whatsoever about ordering is given" - while true - is not what's happening here.

    Digging into the Standard Library headers, I found the cause in bits/hashtable.h's documentation:

    * Here's _Hashtable data structure, each _Hashtable has:
    * - _Bucket[]       _M_buckets
    * - _Hash_node_base _M_before_begin
    * - size_type       _M_bucket_count
    * - size_type       _M_element_count
    *
    * with _Bucket being _Hash_node* and _Hash_node constaining:
    * - _Hash_node*   _M_next
    * - Tp            _M_value
    * - size_t        _M_code if cache_hash_code is true
    *
    * In terms of Standard containers the hastable is like the aggregation  of:
    * - std::forward_list<_Node> containing the elements
    * - std::vector<std::forward_list<_Node>::iterator> representing the  buckets
    *
    * The non-empty buckets contain the node before the first bucket node. This
    * design allow to implement something like a std::forward_list::insert_after
    * on container insertion and std::forward_list::erase_after on container
    * erase calls. _M_before_begin is equivalent to
    * std::foward_list::before_begin. Empty buckets are containing nullptr.
    * Note that one of the non-empty bucket contains &_M_before_begin which  is
    * not a derefenrenceable node so the node pointers in buckets shall never be
    * derefenrenced, only its next node can be.
    *
    * Walk through a bucket nodes require a check on the hash code to see if the
    * node is still in the bucket. Such a design impose a quite efficient hash
    * functor and is one of the reasons it is highly advise to set
    * __cache_hash_code to true.
    *
    * The container iterators are simply built from nodes. This way  incrementing
    * the iterator is perfectly efficient independent of how many empty buckets
    * there are in the container.
    *
    * On insert we compute element hash code and thanks to it find the bucket
    * index. If the element must be inserted on an empty bucket we add it at the
    * beginning of the singly linked list and make the bucket point to
    * _M_before_begin. The bucket that used to point to _M_before_begin, if any,
    * is updated to point to its new before begin node.
    

    So, the hash table underpinning unordered_set is organised with values in a singly-linked list, and the buckets a vector of iterators into that list, rather than the commonly envisaged vector<forward_list<>>.

    As you insert elements, they're going into the forward list at the front, and it's that list that you iterate when going from begin() to end(), without any involvement of the vector of iterators whose ordering corresponds to the hash values.

    Code here illustrates how iteration returns the values in inverse order of insertion, irrespective of hashes / collisions - as long as enough space is reserve()d up-front to avoid rehashing.