I am using OpenSearch to make query to documents in my database, currently I am doing this search (I'm using default_operator=AND
, and the terms between "" are terms 1 4 and 5 are terms of two words that I'm omitting,i.e: "foo bar"):
"term 1" term2 term3 "term 4" OR "term 5"
but when I look at my result, there are documents that have just "term 1" term2 term3
. This changes if I add parentheses, this search returns what I want:
("term 1" term2 term3 "term 4") OR ("term 5")
Is there any sense to have difference between the results of these queries?
I also tried to change the "term 4" position to:
"term 1" "term 4" term2 term3 OR "term 5"
and the results also are differents from the results of the first query, and for me it doesn't make sense.
This is an example of an almost full query:
{
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"query_string": {
"query": "\"term 1\" term2 term3 \"term 4\" OR \"term 5\"",
"fields": [
"my_field.analyzed"
],
"default_operator": "AND",
"boost": 0.1
}
},
{
"query_string": {
"query": "\"term 1\" term2 term3 \"term 4\" OR \"term 5\"",
"fields": [
"my_field_2",
"my_field_3"
],
"boost": 0.5
}
}
]
}
},
{
"exists": {
"field": "my_field"
}
}
],
It is worth noting that the boolean operators DO NOT follow the usual precedence rules (another example here and some more thoughts here).
If you're a JavaCC afficionado, you can also check the compiler definition for Lucene's query string parser. You'll see that the query is parsed sequentially, i.e. there's no precedence as you would expect, except when properly specifying parenthesis.
The main take away from the last link is that instead of thinking in terms of boolean operations, you need to think in terms of OPTIONAL, REQUIRED (i.e. +
), and PROHIBITED (i.e. -
)
Using the Validate API, you can see what's executed on the Lucene side. For instance, the first query below
{
"query_string": {
"query": "\"term 1\" term2 term3 \"term 4\" OR \"term 5\"",
"fields": [
"my_field.analyzed"
],
"default_operator": "AND",
"boost": 0.1
}
},
is executed as
(
+my_field.analyzed:term 1
+my_field.analyzed:term2 term3
my_field.analyzed:term 4
my_field.analyzed:term 5
)^0.1
So,
term 1
is requiredterm2 term3
(both are concatenated together) is requiredterm 4
and term 5
are optionalRegarding the second query,
{
"query_string": {
"query": "\"term 1\" term2 term3 \"term 4\" OR \"term 5\"",
"fields": [
"my_field_2",
"my_field_3"
],
"boost": 0.5
}
}
it is executed as
+(
+(my_field_2:term 1 | my_field_3:term 1)
+(my_field_2:term2 term3 | my_field_3:term2 term3)
(my_field_2:term 4 | my_field_3:term 4)
(my_field_2:term 5 | my_field_3:term 5)
)^0.1
So:
term 1
must be present in either my_field_2
or my_field_3
term2 term3
must be present in either my_field_2
or my_field_3
term 4
can be present in either my_field_2
or my_field_3
term 5
can be present in either my_field_2
or my_field_3
+
at the very beginning)