Given two documents with two fields each:
1. title: United Kingdom requested meeting of United Nations
content: The United Nations will hear statements from the United Kingdom (...)
2. title: Airlines face scrutiny across nation
content: United States airline United Airlines has faced increasing (...)
I'm after a Lucene query which will A) Match instances of the word "united", but NOT when followed by either "States" or "Kingdom", in either the title OR the content field B) Importantly, match both documents even though they contain both a desired and an undesired phrase.
My first port of call has been spanNot()
, which is meant to take two spanTerm
queries in an include, exclude order, followed by a dist
integer, and a boolean indicating whether the terms should be in order. Eg:
spanNot(title:united, title:states, 1, true)
Given this, I've chained the necessary queries using a BooleanQuery
so that the query is this:
(+spanNot(title:united, title:states, 1, true) +spanNot(title:united, title:kingdom, 1, true))
(+spanNot(content:united, content:states, 1, true) +spanNot(content:united, content:kingdom, 1, true))
As you can see, there are two groupings of queries above, which should read logically like this: "(Title must contain united BUT NOT united states, AND title must contain united BUT NOT united kingdom) OR (Content must contain united BUT NOT united states, AND content must contain united BUT NOT united kingdom)"
Conceptually this makes perfect sense to me, however, I'm finding that the results of my query - either the initial spanNot
or the longer chained BooleanQuery
version - are incorrect. Either the entire document is not matched, or each mention of the word "united" is matched - having immense trouble working out the reason why.
For some additional detail: I'm implementing the query builder using the lucene java library in Clojure, but testing out the queries using Kibana's Lucene querying feature, over documents that absolutely should match. Using Lucene v 7.7 - an upgrade is probably on the cards, but I do not believe this would solve my problem.
Any insight would be tremendously appreciated.
This was fixed after much trawling through Lucene documents and source code debugging. Here is the right way to write this query in Lucene:
spanNot(title:united, spanOr([spanNear([title:united, title:states], 0, true), spanNear([title:united, title:kingdom], 0, true)]), 0, 0) spanNot(content:united, spanOr([spanNear([content:united, content:states], 0, true), spanNear([content:united, content:kingdom], 0, true)]), 0, 0) spanNot(summary:united, spanOr([spanNear([summary:united, summary:states], 0, true), spanNear([summary:united, summary:kingdom], 0, true)]), 0, 0)
In case that's difficult to read, it's 3 separate queries (one for each field) made up of a spanNot
with a term query include, and a spanOr
exclude, which itself is comprised of two spanNear
queries - one for each exlcusion term.
The issue before was that there were too many combinations of exclusion terms and fields for any distribution of SHOULD and MUST. The right way to execute this search was one thorough query per field.