I was building a map that could store any generic type like this crate.
By doing so, I'm using a HashMap that takes a TypeId as a key. This TypeId is a hash built at compile time, so to improve performances I was wondering if I could use a custom Hasher that does nothing, as my input is already hashed.
That made me think, what would be the risks of a bad Hasher for any HashMap ? Performances ? Security ?
To understand the risks, you first need to understand how a HashMap works.
There are two main kinds of HashMap: Closed Addressing and Open Addressing. And each kind has many implementation variants.
The Closed Addressing hash maps are simpler, so let's start with those. The underlying data-structure within such a hash map is a dynamic array of buckets:
For a good hash function, it's expected that the hash % array-size function will yield a fairly uniformly spread set of indexes, and thus, that roughly all buckets will have the same size. That is, for N elements in an array of size A, each bucket is expected to have roughly N / A elements. Given that A grows with N based on a load factor L, this means that each bucket has N / (N * L) = 1 / L elements.
Trivially, this leads to an O(1) average complexity for Insertion & Search.
The Open Addressing hash maps are more complex. The underlying data-structure is just a dynamic array of cells:
Or at least, that's the gist, because deleting elements may poke holes in the sequence of cells to search, and there are various strategies to deal with that... let's say for now that we use backward-shifting, ie move all further elements in the sequence back one cell to fill the hole.
You can treat a sequence of cells as a bucket... but due to "neighbour" buckets merging together in a single sequence you have clustering issues which make complexity analysis more difficult.
Well, let's take an absurd case, a function which always returns 41.
Sticking to our Closed Addressing Hash Map, the number of elements in the buckets is then highly skewed:
This means that suddenly Insertion and Search are O(N), not O(1). Oh.
1 Obtained by the roll of a fair dice.
In 2003, a Denial of Service (DoS) attack named Hash Flooding was first described, and in the following years it was successfully weaponized against websites.
The strategy is simple: knowing the hash function used by the website -- typically, the standard hash function provided by the language powering the website -- adversaries would pre-compute a large number of keys which would all result in the same hash (or same hash % array-size for reasonably sized array) then send a GET or POST request with all those keys as parameter names. The web framework would attempt to put all the parameters in a hash map before calling the application, and since all parameter names would have the same hash, ...
... insertion of parameters would overall be O(N2) instead of O(N).
For small values of N, it's not too bad. For N = 1'000, it's the difference between 1'000 operations and 1'000'000 operations, and it starts showing! With only a few "clients" sending such requests in a loop, an entire web server would be brought to its knees.
As a response, both Python & Ruby were updated so that their default hash function would be much harder to built pre-image collisions for, thereby making Hash Flooding impossible.
Note: Open Addressing adds clustering issues, since even nearby buckets can merge together into a single bucket, so that even imperfect collisions can still lead to a perfect storm.
The algorithmic complexity of inserting and searching in a hash map is between O(1) in the best case and O(N) in the worst case. Using a bad hash function may lead to the average case being much closer to the worst case. At this point, your hash map is no better than an unsorted array.
What bad means depends on the scenario:
As an example, we can consider FNV-1a:
And that's it.
TL;DR: your risk performance degradation, which may potentially be used for Denial of Service on your application.