Search code examples
elasticsearchelasticsearch-java-api

Cardinality aggregation vs Terms aggregation with calculating bucket size


I'm using elasticsearch 2.4 and would like to get distinct counts for various entities in my data. I've played around with lot queries which include two ways of calculating distinct counts. One is through a cardinality aggregation and other is doing a terms aggregation can then getting distinct counts by calculating bucket size. By the former approach I've seen the counts being erroneous and inaccurate, but faster and relatively simple. My data is huge and will increase with time, so I do not know how cardinality aggregation will perform, whether it will become more accurate or less accurate.Wanted to take some advice from people who have had this question before and which approach they chose.


Solution

  • cardinality aggregation takes an additional parameter for precision_threshold

    The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. The default values is 3000.

    • configurable precision, which decides on how to trade memory for accuracy,
    • excellent accuracy on low-cardinality sets,
    • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

    In short, cardinality can give you exact counts upto a maximum of 40000 cardinality after which it gives an approximate count. Higher the precision_threshold, higher the memory cost and higher the accuracy. For very high values, it can only give you an approximate count.