Search code examples
javaelasticsearchnested

Elasticsearch - How to find out all the documents where a field value of a nested object is greater than X with the max date


I have a few documents for a company model. Each company has some financial reports. They are designed as nested object of Elasticsearch. For example:

[
  {
    "id": 1,
    "financials": [
      {
        "periodEndDate": "2022-06-30",
        "cash": 10000,
        "inventory": 24000
      },
      {
        "periodEndDate": "2022-12-31",
        "cash": 12000,
        "inventory": 19000
      }
    ]
  },
  {
    "id": 2,
    "financials": [
      {
        "periodEndDate": "2022-12-31",
        "cash": 200000,
        "inventory": 1590000
      },
      {
        "periodEndDate": "2023-03-31",
        "cash": 210000,
        "inventory": 1430000
      }
    ]
  }
]

I want to find out all the companies that have cash > 200000 in the most recent period end date report. The desired result is company 2 because the most recent report of it is the one with periodEndDate 2023-03-31, and the cash of the report is 210000.

How to write the query?

In addition, what is the query if I want to filter the reports with second to the most recent date?

I found this post: ElasticSearch - Get the documents where the value of a field is x and the max date of the type is lower then y date But it requires "negative" filters, which does not work for me because the financial report nested object has too many fields.


Solution

  • If the number of companies is small enough that you don't need to page through them, you can do it using combination of top_hits and bucket_selector aggregations. Basically, the top_hits will find the latest report and bucket_selector will remove all companies that don't have enough cash.

    PUT test
    {
      "mappings": {
        "properties": {
          "id": {
            "type": "long"
          },
          "financials": {
            "type": "nested",
            "properties": {
              "periodEndDate": {
                "type": "date"
              },
              "cash": {
                "type":"long"
              },
              "inventory": {
                "type": "long"
              }
            }
          }
        }
      }
    }
    
    POST test/_bulk?refresh
    {"index":{"_id":1}}
    {"id":1,"financials":[{"periodEndDate":"2022-06-30","cash":10000,"inventory":24000},{"periodEndDate":"2022-12-31","cash":12000,"inventory":19000}]}
    {"index":{"_id":2}}
    {"id":2,"financials":[{"periodEndDate":"2022-12-31","cash":200000,"inventory":1590000},{"periodEndDate":"2023-03-31","cash":210000,"inventory":1430000}]}
    
    POST test/_search
    {
      "size": 0,
      "aggs": {
        "companies": {
          "terms": {
            "field": "id"
          },
          "aggs": {
            "nested": {
              "nested": {
                "path": "financials"
              },
              "aggs": {
                "last_cash": {
                  "top_hits": {
                    "sort": [
                      {
                        "financials.periodEndDate": {
                          "order": "desc"
                        }
                      }
                    ],
                    "_source": {
                      "includes": [
                        "financials.cash",
                        "financials.periodEndDate"
                      ]
                    },
                    "size": 1
                  }
                }
              }
            },
            "having.top_cash": {
              "bucket_selector": {
                "buckets_path": {
                  "last_cash": "nested>last_cash[_source.cash]"
                },
                "script": "params.last_cash > 200000"
              }
            }
          }
        }
      }
    }
    

    This method has several limitations including bad scalability, limited number of companies it can handle, and not supporting the second oldest report. It's also not the best solution if you will also need to sort the company by relevance or page through them.

    To achieve all that functionality and to build a scalable solution I would suggest adding an additional processing during indexing time and indexing the latest report and the second to the latest report as dedicated fields in your main object instead of storing them as nested and trying to find the latest and the second latest during search time. So, basically your records should look like this:

      {
        "id": 1,
        "latest_finanicals": {
                "periodEndDate": "2022-12-31",
                "cash": 12000,
                "inventory": 19000
            }
        },
        "previous_finanicals": {
                "periodEndDate": "2022-06-30",
                "cash": 10000,
                "inventory": 24000
            }
        }
        "financials": [
          {
            "periodEndDate": "2022-06-30",
            "cash": 10000,
            "inventory": 24000
          },
          {
            "periodEndDate": "2022-12-31",
            "cash": 12000,
            "inventory": 19000
          }
        ]
      }
    

    With latest_finanicals and previous_finanicals having object type instead of nested type. Then all your queries will become very simple.