c++data-structures hashmap hashtable unordered-map

C++ comparing a pre-reserved hash map(std::unordered_map) with integer key and contiguous data array(std::vector)

Assume that using a hash map structure with int key type:

std::unordered_map<int, data_type> um;

Plus, when the total(or maximum) number of elements N is known, hash table can be constructed in advance.

um.reserve(N); // This will chainly call rehash() function...

Here, an integer itself can be used as an identity(hash) function for a hash table, as far as I know.

Meanwhile, for a contiguous data set(i.e. std::vector, or a simple array), it can be random-accessed by displacement from the address of front-most data.

Both containers use int as an accessing key, like this:

um[1] = data_type(1); //std::unordered_map<int, data_type>
v[1] = data_type(1); //std::vector<data_type>

Then, is there any difference between the constructed hash table and std::vector, in memory usage or in searching mechanism/performance, or in anything else?

Let's make the problem tangible.

If I know that 3 keys 0,5, 9987 are certainly used, but keys 1~9986 may or may not be used.

If I know no key in the set would be bigger than 10000, then using std::vector of size 10000 will guarantee O(1) time complexity for accessing random data, but memory would be wasted.

In this situation, does std::unordered_map produce a better solution for the problem? *I mean, a solution that saves as much memory as possible while maintaining the time complexity in the same level.

Solution

Plus, when the total(or maximum) number of elements N is known, hash table can be constructed in advance.

um.reserve(N); // This will chainly call rehash() function...

Here, an integer itself can be used as an identity(hash) function for a hash table, as far as I know.

That's true, and reasonable in two very different scenarios: 1) when the values are pretty much contiguous with perhaps a few missing values, or 2) when the values are quite random. In many other situations, you may risk excessive hash table collisions if you don't provide a meaningful hash function.

Then, is there any difference between the constructed hash table and std::vector, in memory usage or in searching mechanism/performance, or in anything else?

Yes. After your .reserve(N), the hash table allocates a contiguous block of memory (basically, an array) for at least N "buckets". If we consider the GCC implementation, N will be rounded up to a prime. Each bucket may store an iterator into a forward-linked list of pair<int, data_type> nodes.

So, if you actually put N entries into the hash table, you have...

an array of >= N elements of sizeof(forward-list-iterator) size
N memory allocations of >= sizeof(pair<int, data_type>) + sizeof(next-pointer/iterator for forward-list)

...whilst the vector only uses about N * sizeof(data_type) bytes of memory: potentially a small fraction of the memory used by the hash table, and as all the vector's memory for data_types is contiguous, you're much more likely to benefit from the CPU caching elements adjacent to one you're currently trying to access, such that they're all much faster to access later.

On the other hand, if you haven't put many elements into the hash table, then the main thing using memory is the array of buckets containing iterators, which are usually the size of pointers (e.g. 32 or 64 bits each), whereas the vector of data_type - if you reserve(N) there too - will already have allocated N * sizeof(data_type) bytes of memory - for large data_types that may be massively more than the hash table. Still, you can often allocate virtual memory, and if you haven't faulted the pages of memory in such that they need physical backing memory, there's no meaningful memory usage or performance penalty to your program or computer. (At least with 64 bit programs, virtual address space is effectively unlimited).

If I know that 3 keys 0,5, 9987 are certainly used, but keys 1~9986 may or may not be used.

If I know no key in the set would be bigger than 10000, then using std::vector of size 10000 will guarantee O(1) time complexity for accessing random data, but memory would be wasted.

In this situation, does std::unordered_map produce a better solution for the problem? *I mean, a solution that saves as much memory as possible while maintaining the time complexity in the same level.

In this situation, if you reversed(10000) up front and the data_type was not significantly bigger than an iterator/pointer, then the unordered_map would be unequivocably worse in every regard. If you don't reserve up front, the hash table would only allocate space for a handful of buckets, and you'd be using a lot less virtual address space than a vector with 10000 elements (even if data_type was bool).