I'm using elasticsearch 2.4 and would like to get distinct counts for various entities in my data. I've played around with lot queries which include two ways of calculating distinct counts. One is through a cardinality aggregation and other is doing a terms aggregation can then getting distinct counts by calculating bucket size. By the former approach I've seen the counts being erroneous and inaccurate, but faster and relatively simple. My data is huge and will increase with time, so I do not know how cardinality aggregation will perform, whether it will become more accurate or less accurate.Wanted to take some advice from people who have had this question before and which approach they chose.
cardinality aggregation takes an additional parameter for precision_threshold
The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. The default values is 3000.
In short, cardinality can give you exact counts upto a maximum of 40000 cardinality after which it gives an approximate count. Higher the precision_threshold, higher the memory cost and higher the accuracy. For very high values, it can only give you an approximate count.