Search code examples
rusthashhashmap

What are the risk of bad HashMap hashers?


I was building a map that could store any generic type like this crate.

By doing so, I'm using a HashMap that takes a TypeId as a key. This TypeId is a hash built at compile time, so to improve performances I was wondering if I could use a custom Hasher that does nothing, as my input is already hashed.

That made me think, what would be the risks of a bad Hasher for any HashMap ? Performances ? Security ?


Solution

  • HashMap Primer

    To understand the risks, you first need to understand how a HashMap works.

    There are two main kinds of HashMap: Closed Addressing and Open Addressing. And each kind has many implementation variants.

    The Closed Addressing hash maps are simpler, so let's start with those. The underlying data-structure within such a hash map is a dynamic array of buckets:

    • Insertion: compute the hash, derive an index as hash % array-size, iterate over all the elements in the bucket to check for a duplicate and if not, insert.
    • Search: compute the hash, derive an index as hash % array-size, iterate over the elements in the bucket until you find a match (or exhaust the bucket).

    For a good hash function, it's expected that the hash % array-size function will yield a fairly uniformly spread set of indexes, and thus, that roughly all buckets will have the same size. That is, for N elements in an array of size A, each bucket is expected to have roughly N / A elements. Given that A grows with N based on a load factor L, this means that each bucket has N / (N * L) = 1 / L elements.

    Trivially, this leads to an O(1) average complexity for Insertion & Search.

    The Open Addressing hash maps are more complex. The underlying data-structure is just a dynamic array of cells:

    • Insertion: compute the hash, derive an index as hash % array-size, iterate until you find a matching cell, and if you stumble upon an empty cell, insert.
    • Search: compute the hash, derive an index as hash % array-size, iterate until you find a matching cell, and if you stumble upon an empty cell, abort.

    Or at least, that's the gist, because deleting elements may poke holes in the sequence of cells to search, and there are various strategies to deal with that... let's say for now that we use backward-shifting, ie move all further elements in the sequence back one cell to fill the hole.

    You can treat a sequence of cells as a bucket... but due to "neighbour" buckets merging together in a single sequence you have clustering issues which make complexity analysis more difficult.

    Effects of a poor hash function

    Well, let's take an absurd case, a function which always returns 41.

    Sticking to our Closed Addressing Hash Map, the number of elements in the buckets is then highly skewed:

    • All buckets (except the one at index 4) have 0 elements.
    • The bucket at index 4 has N elements.

    This means that suddenly Insertion and Search are O(N), not O(1). Oh.

    1 Obtained by the roll of a fair dice.

    Hash Flooding

    In 2003, a Denial of Service (DoS) attack named Hash Flooding was first described, and in the following years it was successfully weaponized against websites.

    The strategy is simple: knowing the hash function used by the website -- typically, the standard hash function provided by the language powering the website -- adversaries would pre-compute a large number of keys which would all result in the same hash (or same hash % array-size for reasonably sized array) then send a GET or POST request with all those keys as parameter names. The web framework would attempt to put all the parameters in a hash map before calling the application, and since all parameter names would have the same hash, ...

    ... insertion of parameters would overall be O(N2) instead of O(N).

    For small values of N, it's not too bad. For N = 1'000, it's the difference between 1'000 operations and 1'000'000 operations, and it starts showing! With only a few "clients" sending such requests in a loop, an entire web server would be brought to its knees.

    As a response, both Python & Ruby were updated so that their default hash function would be much harder to built pre-image collisions for, thereby making Hash Flooding impossible.

    Note: Open Addressing adds clustering issues, since even nearby buckets can merge together into a single bucket, so that even imperfect collisions can still lead to a perfect storm.

    Risks

    The algorithmic complexity of inserting and searching in a hash map is between O(1) in the best case and O(N) in the worst case. Using a bad hash function may lead to the average case being much closer to the worst case. At this point, your hash map is no better than an unsorted array.

    What bad means depends on the scenario:

    • If your input is under your control, or random, a bad hash function is simply a function which has many collisions. Such as a parity bit function, always returning 0 or 1.
    • If your input is potentially controlled by an adversary, a bad hash function is a function for which pre-image collisions can relatively easily be found.

    As an example, we can consider FNV-1a:

    • If your input is controlled, or random, it's a fair hash function. Not necessarily the best, but it's pretty fast, and the distribution of hashes is pretty good.
    • If your input is potentially controlled by an adversary, it's a terrible hash function. It's fairly easy to craft colliding inputs using maths, so your adversary can in no time whip up a 1'000 inputs which will all collide.

    And that's it.

    TL;DR: your risk performance degradation, which may potentially be used for Denial of Service on your application.