Using pre-computed hashes for ElasticSearch cardinality aggregations ( without murmur3 )

According to the ElasticSearch documentation, you can improve the performance of cardinality aggregations by pre-computing hashes before indexing a document. The documentation recommends using the mapper-murmur3 plug-in, however it also specifies you can do it from the client-side without the plug-in. I have a high cardinality string/keyword field that I'm running cardinality aggregations against and I'd like to explore ways to pre-compute the hashes without the murmur3 plugin.

My questions are as follows:

If we pre-compute hashes prior to indexing, does the data type of the hash value in the indexed document matter? Does it need to be hashed to a long or a numeric type value/field, or is a string based keyword field OK?
If the hashed value is stored in a string based keyword, how does ElasticSearch know that it doesn't need to compute the hash on that value? Wouldn't that value look like any other string field if it's indexed as a string/keyword field?
Lastly, does pre-computing hashes have a meaningful impact on the memory use of the cardinality aggregation, or mainly speed? My instinct is that it wouldn't use much less memory with the pre-computed hashes since all it's just removing the step of having to compute the hash on each unique value, but I thought I'd ask to gain a better understanding of what it's doing under the covers.

Thanks!

Solution

In pre-2.x versions of Elasticsearch (up to 1.7), the cardinality aggregation used to provide a rehash: true/false flag which allowed you to specify whether the value of the field on which the cardinality aggregation was run needed to be hashed or not at search-time. This was used in the case the field value was already hashed by the client code and stored/indexed in a hashed fashion. However, this options disappeared in 2.x when the mapper-murmur3 plugin was introduced as hashing was deemed cheap.

What you need to know concerning Murmur3 is that it is a full-fledge field type called murmur3, which is normally used as a sub-field that automatically hashes the value of the parent field into a numeric hash (not a string-based one, like UUID, SHA256 or MD5). Also, it usually only makes sense to use murmur3 hash fields on high-cardinality keyword fields, but not on either numeric fields or low-cardinality keyword fields, as the gains would be negligible, while increasing your disk usage.

So if you decide to pre-compute your hashes using your own hashing function you have two options:

you can use a hashing function that yields a numeric hash value similar to what Murmur3 does
you can use a hashing function that yields a string hash value and if you want to use that value directly without rehashing it, you can configure a specific execution_hint called direct (available since 8.4), in which case it would behave in the exact same way as for numeric hashes, such as Murmur3.

Sample aggregation query on a string-based hash field with direct execution hint:

GET test/_search
{
  "size": 0,
  "aggs": {
    "count": {
      "cardinality": {
        "field": "hash_field",
        "execution_hint": "direct"         <---- add this
      }
    }
  }
}

It's also interesting to see how the picking works directly in the source code of the CardinalityAggregator class.

I think that should answer all your above questions.