Search code examples
elasticsearchelasticsearch-dslelasticsearch-queryelasticsearch-sql

Use query result as parameter for another query in Elasticsearch DSL


I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:

{
  "query": {
    "bool": {
      "must_not": {
        "terms": {
          "request_id": {
            "query": {
              "match": {
                "processing.message": "OUT Followup Synthesis"
              }
            },
            "fields": [
              "request_id"
            ],
            "_source": false
          }
        }
      }
    }
  }
}

As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.

I'm getting an error with this query:

Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]

How can I achieve my goal using Elasticsearch DSL?


Solution

  • Original question extracted from the comments

    I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:

    SELECT d FROM   data d
    WHERE  d.processing.message = 'IN Followup Sythesis'
           AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis'); 
    

    Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.

    So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.


    Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.

    Taking these 3 documents as examples:

    POST reqs/_doc
    {"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}
    
    POST reqs/_doc
    {"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}
    
    POST reqs/_doc
    {"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}
    

    you could run

    POST reqs/_search
    {
      "size": 0,
      "query": {
        "match": {
          "processing.message": "IN Followup Sythesis"
        }
      },
      "aggs": {
        "subquery_mock": {
          "scripted_metric": {
            "params": {
              "disallowed_msg": "OUT Followup Synthesis"
            }, 
            "init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
            "map_script": """
              def req_id = params._source.request_id;
              def msg = params._source.processing.message;
              
              if (msg.contains(params.disallowed_msg)) {
                state.disallowed_request_ids.add(req_id);
                // won't need this particular doc so continue looping
                return;
              }
              
              if (state.by_request_ids.containsKey(req_id)) {
                // there may be multiple docs under the same ID
                // so concatenate them
                state.by_request_ids[req_id].add(params._source);
              } else {
                // initialize an appendable arraylist
                state.by_request_ids[req_id] = [params._source];
              }
            """,
            "combine_script": """
              state.by_request_ids.entrySet()
                .removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
              return state.by_request_ids
            """,
            "reduce_script": "return states"
          }
        }
      }
    }
    

    which'd return only the correct request:

    "aggregations" : {
      "subquery_mock" : {
        "value" : [
          {
            "xyz" : [
              {
                "processing" : { "message" : "IN Followup Sythesis" },
                "request_id" : "xyz"
              }
            ]
          }
        ]
      }
    }
    

    ⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".

    💡 I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.


    FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.