Search code examples
elasticsearchsearchelasticsearch-aggregationresthighlevelclientelastic-rest-client

Calculate exact count of distinct values for combination of 2 fields in Elasticsearch


I have around 40 million records in my elasticsearch index. I want to calculate count of distinct values for combination of 2 fields.

Example for given set of documents:

[
 {
  "JobId" : 2,
  "DesigId" : 12
 },
 {
  "JobId" : 2,
  "DesigId" : 4
 },
 {
  "JobId" : 3,
  "DesigId" : 5
 },
 {
  "JobId" : 2,
  "DesigId" : 4
 },
 {
  "JobId" : 3,
  "DesigId" : 5
 }
]

For above example, I should get the count = 3 as only 3 distinct values exists : [(2,12),(2,4),(3,5)]

I tried using cardinality aggregation for this but that provides an approximate count. I want to calculate the exact count accurately.

Below is the query which I used using cardinality aggregation:

"aggs": {
        "counts": {
            "cardinality": {
                "script": "doc['JobId'].value + ',' + doc['DesigId'].value",
                "precision_threshold": 40000
            }
        }
    }

I also tried using composite aggregation on combination of 2 fields using after key and counting the overall size of buckets but that process is really time taking and my query is getting timed out.

Is there any optimal way to achieve it?


Solution

  • Scripting should be avoided as it affects performance. For your use case, there are 3 ways by which you can achieve your required results :

    1. Using Composite Aggregation (which you have already tried)
    2. Using Multi terms aggregation, but this is not memory efficient solution

    Search Query :

    {
        "size": 0,
        "aggs": {
            "jobId_and_DesigId": {
                "multi_terms": {
                    "terms": [
                        {
                            "field": "JobId"
                        },
                        {
                            "field": "DesigId"
                        }
                    ]
                }
            }
        }
    }
    

    Search Result:

    "aggregations": {
            "jobId_and_DesigId": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                    {
                        "key": [
                            2,
                            4
                        ],
                        "key_as_string": "2|4",
                        "doc_count": 2
                    },
                    {
                        "key": [
                            3,
                            5
                        ],
                        "key_as_string": "3|5",
                        "doc_count": 2
                    },
                    {
                        "key": [
                            2,
                            12
                        ],
                        "key_as_string": "2|12",
                        "doc_count": 1
                    }
                ]
            }
        }
    
    1. The combined field value (i.e., the combination of "JobId" and "DesigId") should be stored at the index time itself as this is the best method. This is possible by using a set processor.
    PUT /_ingest/pipeline/concat
    {
      "processors": [
        {
          "set": {
            "field": "combined_field",
            "value": "{{JobId}} {{DesigId}}"
          }
        }
      ]
    }
    

    Index API

    When indexing the documents, you need to add pipeline=concat query param, each time you index the documents. Suppose a index API will look like :

    POST _doc/1?pipeline=concat
    {
        "JobId": 2,
        "DesigId": 12
        
    }
    

    Search Query:

    {
        "size": 0,
        "aggs": {
            "jobId_and_DesigId": {
                "terms": {
                    "field":"combined_field.keyword"
                }
            }
        }
    }
    

    Search Result:

     "aggregations": {
            "jobId_and_DesigId": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                    {
                        "key": "2 4",
                        "doc_count": 2
                    },
                    {
                        "key": "3 5",
                        "doc_count": 2
                    },
                    {
                        "key": "2 12",
                        "doc_count": 1
                    }
                ]
            }
        }