Search code examples
elasticsearchopensearch

OpenSearch return different results when using parenthesis


I am using OpenSearch to make query to documents in my database, currently I am doing this search (I'm using default_operator=AND, and the terms between "" are terms 1 4 and 5 are terms of two words that I'm omitting,i.e: "foo bar"):

"term 1" term2 term3 "term 4" OR "term 5"

but when I look at my result, there are documents that have just "term 1" term2 term3. This changes if I add parentheses, this search returns what I want: ("term 1" term2 term3 "term 4") OR ("term 5")

Is there any sense to have difference between the results of these queries?

I also tried to change the "term 4" position to:

"term 1" "term 4" term2 term3 OR "term 5"

and the results also are differents from the results of the first query, and for me it doesn't make sense.

This is an example of an almost full query:

{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "query_string": {
                  "query": "\"term 1\" term2 term3 \"term 4\" OR \"term 5\"",
                  "fields": [
                    "my_field.analyzed"
                  ],
                  "default_operator": "AND",
                  "boost": 0.1
                }
              },
              {
                "query_string": {
                  "query": "\"term 1\" term2 term3 \"term 4\" OR \"term 5\"",
                  "fields": [
                    "my_field_2",
                    "my_field_3"
                  ],
                  "boost": 0.5
                }
              }
            ]
          }
        },
        {
          "exists": {
            "field": "my_field"
          }
        }
      ],

Solution

  • It is worth noting that the boolean operators DO NOT follow the usual precedence rules (another example here and some more thoughts here).

    If you're a JavaCC afficionado, you can also check the compiler definition for Lucene's query string parser. You'll see that the query is parsed sequentially, i.e. there's no precedence as you would expect, except when properly specifying parenthesis.

    The main take away from the last link is that instead of thinking in terms of boolean operations, you need to think in terms of OPTIONAL, REQUIRED (i.e. +), and PROHIBITED (i.e. -)

    Using the Validate API, you can see what's executed on the Lucene side. For instance, the first query below

              {
                "query_string": {
                  "query": "\"term 1\" term2 term3 \"term 4\" OR \"term 5\"",
                  "fields": [
                    "my_field.analyzed"
                  ],
                  "default_operator": "AND",
                  "boost": 0.1
                }
              },
    

    is executed as

    (
      +my_field.analyzed:term 1 
      +my_field.analyzed:term2 term3 
      my_field.analyzed:term 4 
      my_field.analyzed:term 5
    )^0.1
    

    So,

    • term 1 is required
    • term2 term3 (both are concatenated together) is required
    • term 4 and term 5 are optional

    Regarding the second query,

              {
                "query_string": {
                  "query": "\"term 1\" term2 term3 \"term 4\" OR \"term 5\"",
                  "fields": [
                    "my_field_2",
                    "my_field_3"
                  ],
                  "boost": 0.5
                }
              }
    

    it is executed as

    +(
      +(my_field_2:term 1 | my_field_3:term 1)
      +(my_field_2:term2 term3 | my_field_3:term2 term3) 
      (my_field_2:term 4 | my_field_3:term 4) 
      (my_field_2:term 5 | my_field_3:term 5)
    )^0.1
    

    So:

    • term 1 must be present in either my_field_2 or my_field_3
    • term2 term3 must be present in either my_field_2 or my_field_3
    • term 4 can be present in either my_field_2 or my_field_3
    • term 5 can be present in either my_field_2 or my_field_3
    • at least one of the above must match (i.e. the initial + at the very beginning)