Search code examples
elasticsearchopensearch

How can I filter bucket aggregation results based on sub-bucket aggregation document count?


I need a query where the results will exclude any userIds if they have at least 1 document with the tag set to a value within an 'excluded' list i.e. TAG A or TAG B.

I have an index with data like below:

{
  "_index": "tags-3",
  "_type": "_doc",
  "_id": "YYYYYYY",
  "_score": 10.272416,
  "_source": {
    "id": "YYYYYYY",
    "userId": "User1",
    "tag": "TAG A"
  }
},
{
  "_index": "tags-3",
  "_type": "_doc",
  "_id": "ZZZZZZ",
  "_score": 10.272416,
  "_source": {
    "id": "ZZZZZZ",
    "userId": "User1",
    "tag": "TAG B"
  },
{
  "_index": "tags-3",
  "_type": "_doc",
  "_id": "ZZZZZZ",
  "_score": 10.272416,
  "_source": {
    "id": "ZZZZZZ",
    "userId": "User2",
    "tag": "TAG A"
},
{
  "_index": "tags-3",
  "_type": "_doc",
  "_id": "ZZZZZZ",
  "_score": 10.272416,
  "_source": {
    "id": "ZZZZZZ",
    "userId": "User2",
    "tag": "TAG D"
},
{
  "_index": "tags-3",
  "_type": "_doc",
  "_id": "ZZZZZZ",
  "_score": 10.272416,
  "_source": {
    "id": "ZZZZZZ",
    "userId": "User4",
    "tag": "TAG D"
}

For the input above, I would expect an output of:

{
  "_index": "tags-3",
  "_type": "_doc",
  "_id": "ZZZZZZ",
  "_source": {
    "userId": "User4"
  }

since User4 has no documents with the tag set to TAG A or TAG B.

User4 is the only other user with a document with the tag set to TAG D however since it has another document with TAG B, it is excluded.


Solution

  • One way to do this would be to:

    1. Aggregate (group) on the user IDs - this would give you all the user IDs
    2. Then, aggregate the documents for each user ID (nested aggregation) with a filter for the multiple (or single) tag values you want to exclude - this would give you the total sum of documents with the tag set to an excluded tag for each user ID
    3. Finally, perform a bucket selector aggregation, only including user IDs which have a count of 0 for any excluded documents; this would give you the users who don't have any documents with any excluded tag values

    This query should work, for an excluded tag list of A, B & C:

    {
      "aggs": {
        "user-ids": {
          "terms": {
            "field": "userId.keyword",
            "size": 10000
          },
          "aggs": {
            "excluded_tags_agg": {
              "filter": {
                "bool": {
                  "should": [
                    {
                      "match_phrase": {
                        "tag.keyword": "TAG A"
                      }
                    },
                    {
                      "match_phrase": {
                        "tag.keyword": "TAG B"
                      }
                    },
                    {
                      "match_phrase": {
                        "tag.keyword": "TAG C"
                      }
                    }
                  ],
                  "minimum_should_match": 1
                }
              }
            },
            "filter_userids_which_do_not_have_any_docs_with_excluded_tags": {
              "bucket_selector": {
                "buckets_path": {
                  "doc_count": "excluded_tags_agg > _count"
                },
                "script": "params.doc_count == 0"
              }
            }
          }
        }
      },
      "size": 0
    }