Search code examples
elasticsearchelasticsearch-aggregation

ElasticSearch - Combine filters & Composite Query to get unique fields combinations


Well.. I am quite "newb" regarding ES so regarding aggregation... there is no words in the dictionary to describe my level regarding it :p

Today I am facing an issue where I am trying to create a query that should execute something similar to a SQL DISTINCT, but among filters. I have this document given (of course, an abstraction of the real situation):

{
  "id": "1",
  "createdAt": 1626783747,
  "updatedAt": 1626783747,
  "isAvailable": true,
  "kind": "document",
  "classification": {
    "id": 1,
    "name": "a_name_for_id_1"
  },
  "structure": {
    "material": "cartoon",
    "thickness": 5
  },
  "shared": true,
  "objective": "stackoverflow"
}

As all the data of the above document can vary, I however have some values that can be redundant, such as classification.id, kind, structure.material.

So, in order to fullfit my requirements, I would like to "group by" these 3 fields in order to have a unique combination of each. If we go deeper, with the following data, I should get the following possibilities:

[{
        "id": "1",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": true,
        "kind": "document",
        "classification": {
            "id": 1,
            "name": "a_name_for_id_1"
        },
        "structure": {
            "material": "cartoon",
            "thickness": 5
        },
        "shared": true,
        "objective": "stackoverflow"
    },
    {
        "id": "2",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": true,
        "kind": "document",
        "classification": {
            "id": 2,
            "name": "a_name_for_id_2"
        },
        "structure": {
            "material": "iron",
            "thickness": 3
        },
        "shared": true,
        "objective": "linkedin"
    },
    {
        "id": "3",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": false,
        "kind": "document",
        "classification": {
            "id": 2,
            "name": "a_name_for_id_2"
        },
        "structure": {
            "material": "paper",
            "thickness": 1
        },
        "shared": false,
        "objective": "tiktok"
    },
    {
        "id": "4",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": true,
        "kind": "document",
        "classification": {
            "id": 3,
            "name": "a_name_for_id_3"
        },
        "structure": {
            "material": "cartoon",
            "thickness": 5
        },
        "shared": false,
        "objective": "snapchat"
    },
    {
        "id": "5",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": true,
        "kind": "document",
        "classification": {
            "id": 3,
            "name": "a_name_for_id_3"
        },
        "structure": {
            "material": "paper",
            "thickness": 1
        },
        "shared": true,
        "objective": "twitter"
    },
    {
        "id": "6",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": false,
        "kind": "document",
        "classification": {
            "id": 3,
            "name": "a_name_for_id_3"
        },
        "structure": {
            "material": "iron",
            "thickness": 3
        },
        "shared": true,
        "objective": "facebook"
    }
]

based on the above, I should get the following results in the "buckets":

  • document 1 cartoon
  • document 2 iron
  • document 2 paper
  • document 3 cartoon
  • document 3 paper
  • document 3 iron

Of course, for the sake of this example (and to make it easier, I yet don't have any duplicates)

However, on top of that, I need some "pre-filters" as I only want:

  • Documents that are available isAvailable=true
  • Documents'structure's thickness should range between 2 and 4 included: 2 >= structure.thickness >= 4
  • Document's that are shared shared=true

I should so then get only the following combinations compared to the first set of results:

  • document 1 cartoon -> not a valid result, thickness > 4
  • document 2 iron
  • document 2 paper -> not a valid result, isAvailable != true
  • document 3 cartoon -> not a valid result, thickness > 4
  • document 3 cartoon -> not a valid result, thickness < 2
  • document 3 iron -> not a valid result, isAvailable != true

If you're still reading, well.. thanks! xD

So, as you can see, I need all the possible combination of this field regarding the static pattern kind <> classification_id <> structure_material that are matching the filters regarding isAvailable, thickness, shared.

Regarding the output, the hits doesn't matter to me as I don't need the documents but only the combination kind <> classification_id <> structure_material :)

Thanks for any help :)

Max


Solution

  • Thanks to a colleague, I could finally get it working as expected!

    QUERY

    GET index-latest/_search
    {
       "size": 0,
       "query": {
          "bool": {
             "filter": [
                {
                   "term": {
                      "isAvailable": true
                   }
                },
                {
                   "range": {
                      "structure.thickness": {
                         "gte": 2,
                         "lte": 4
                      }
                   }
                },
                {
                   "term": {
                      "shared": true
                   }
                }
             ]
          }
       },
       "aggs": {
          "my_agg_example": {
             "composite": {
                "size": 10,
                "sources": [
                   {
                      "kind": {
                         "terms": {
                            "field": "kind.keyword",
                            "order": "asc"
                         }
                      }
                   },
                   {
                      "classification_id": {
                         "terms": {
                            "field": "classification.id",
                            "order": "asc"
                         }
                      }
                   },
                   {
                      "structure_material": {
                         "terms": {
                            "field": "structure.material.keyword",
                            "order": "asc"
                         }
                      }
                   }
                ]
             }
          }
       }
    }
    

    The given result is then:

    {
       "took": 11,
       "timed_out": false,
       "_shards": {
          "total": 1,
          "successful": 1,
          "skipped": 0,
          "failed": 0
       },
       "hits": {
          "total": {
             "value": 1,
             "relation": "eq"
          },
          "max_score": null,
          "hits": []
       },
       "aggregations": {
          "my_agg_example": {
             "after_key": {
                "kind": "document",
                "classification_id": 2,
                "structure_material": "iron"
             },
             "buckets": [
                {
                   "key": {
                      "kind": "document",
                      "classification_id": 2,
                      "structure_material": "iron"
                   },
                   "doc_count": 1
                }
             ]
          }
       }
    }
    

    So, as we can see, we get the following bucket:

    {
        "key": {
            "kind": "document",
            "classification_id": 2,
            "structure_material": "iron"
        },
        "doc_count": 1
    }
    

    Note: Be careful regarding the type of your field.. putting .keyword on classification.id was resulting to no results in the buckets... .keyword should be use only on types such as string (as far as I understood, correct me if I am wrong)

    As expected, we have the following result (compared to the initial question):

    • document 2 iron

    Note: Be careful, the order of the elements within the aggs.<name>.composite.sources does play a role in the returned results.

    Thanks!