Search code examples
elasticsearch

ElasticSearch: Filter documents when one field is same but another field is different


Assume I have documents as following:

{"id": "a", "stage": 1}, 
{"id": "a", "stage": 2},

{"id": "b", "stage": 1},  

How can I filter out (don't return), id = a since there are 2 documents with different stage values. In other words, I only want to retrieve documents that have a single stage entry. If it helps, the query can contain stage = 1.


Solution

  • Each document is isolated from each other, so you cannot say "I only want to retrieve documents that have a single stage entry", because each document has a single stage entry.

    In order to achieve what you expect, you would first need to "aggregate" your documents into a new document set whose pivot is the id field.

    If you run the following aggregation query that aggregates your documents on the id field, you'll get a list of buckets for each id and how many documents, or stages, they contain, with the ones having a single stage coming first.

    GET index/_search
    {
      "size": 0,
      "aggs": {
        "ids": {
          "terms": {
            "field": "id",
            "size": 100,
            "order": {
              "_count": "asc"
            }
          }
        }
      }
    }
    

    If you have less than 10000 distinct ids that have a single stage, you can increase the size parameter to 10000 and figure out at a glance all buckets having doc_count = 1. However, if you have more, then you need to resort to using the composite aggregation so that you can paginate through your buckets until you have all the ones you need.

    To sum up, with the terms or composite aggregations, you're not able to only return the ones that have doc_count: 1, you can only sort by the number of document count in ascending order and iterate over the buckets until you encounter a bucket with doc_count = 2.

    Another way is to use a Transform that will create a new index based on this composite/terms aggregation and then you'll be able to query that new aggregated index for only the IDs with doc_count = 1.