Search code examples
amazon-web-serviceselasticsearchbucket

Composite elastic search buckets only those cases where both source exist


I have an elasticsearch query, which has an aggregation like this:

"aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    { "abc.firstfield": 
                        { "terms" :
                            { "field": "abc.firstfield" }

                        }
                    },
                    { "abc.secondfield": 
                        { "terms" :
                            { "field": "abc.secondfield"  }
                        }
                    }
                ], "size": 10000
            }
        }
     }

The original intent was that both firstfield and secondfield has the same kind of values and count up ("bucketing") the number of items which has for instance ABC in either firstfield or secondfield. That wouldn't be a bucketing though because it would mean that a document can go into two buckets (if firstfield or secondfield is different). But going forward with my current problem. So this aggregation is simply having a firstfield+secondfield bucketing as I aware of and it would be also enough for me. The problem is, that in those cases, where for instance, secondfield is missing, there is no bucket which is firstfield: something, secondfield: empty. So lots of elements are out from the bucketing.

Documentation says that all fields has to be defined: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

What can I do to have similar results but having bucketing like

{
   "key": {
      "abc.firstfield": "abcd"
   },
   "doc_count": 123
}

I saw only this kind of possible approach, but this means that I have to be sure, that firstfield is always available:

"aggs": {
        "first": {
            "terms": {
                "size": 10000,
                "field": "abc.firstfield"
            },
            "aggs": {
                "second": {
                    "terms": {
                        "size": 10000,
                        "field": "abc.secondfield"
                    }
                }
            }
        }
    }

And it has a bit messier result to parse for my taste.


Solution

  • In composite aggregation you can use missing_bucket: true

    {
      "aggs": {
        "my_buckets": {
          "composite": {
            "sources": [
              {
                "field1": {
                  "terms": {
                    "field": "field1.keyword",
                    "missing_bucket": true
                  }
                }
              },
              {
                "field2": {
                  "terms": {
                    "field": "field2.keyword",
                    "missing_bucket": true
                  }
                }
              }
            ],
            "size": 10000
          }
        }
      }
    }
    

    Result:

    "buckets" : [
            {
              "key" : {
                "field1" : "abc",
                "field2" : "xyz"
              },
              "doc_count" : 1
            },
            {
              "key" : {
                "field1" : "def",
                "field2" : null
              },
              "doc_count" : 1
            }
          ]