Search code examples
elasticsearchtextclojurelucenekibana

SpanNot Lucene Query being either too strict or too permissive


Given two documents with two fields each:

1. title: United Kingdom requested meeting of United Nations
   content: The United Nations will hear statements from the United Kingdom (...)

2. title: Airlines face scrutiny across nation
   content: United States airline United Airlines has faced increasing (...)

I'm after a Lucene query which will A) Match instances of the word "united", but NOT when followed by either "States" or "Kingdom", in either the title OR the content field B) Importantly, match both documents even though they contain both a desired and an undesired phrase.

My first port of call has been spanNot(), which is meant to take two spanTerm queries in an include, exclude order, followed by a dist integer, and a boolean indicating whether the terms should be in order. Eg:

spanNot(title:united, title:states, 1, true)

Given this, I've chained the necessary queries using a BooleanQuery so that the query is this:

(+spanNot(title:united, title:states, 1, true) +spanNot(title:united, title:kingdom, 1, true))
(+spanNot(content:united, content:states, 1, true) +spanNot(content:united, content:kingdom, 1, true))

As you can see, there are two groupings of queries above, which should read logically like this: "(Title must contain united BUT NOT united states, AND title must contain united BUT NOT united kingdom) OR (Content must contain united BUT NOT united states, AND content must contain united BUT NOT united kingdom)"

Conceptually this makes perfect sense to me, however, I'm finding that the results of my query - either the initial spanNot or the longer chained BooleanQuery version - are incorrect. Either the entire document is not matched, or each mention of the word "united" is matched - having immense trouble working out the reason why.

For some additional detail: I'm implementing the query builder using the lucene java library in Clojure, but testing out the queries using Kibana's Lucene querying feature, over documents that absolutely should match. Using Lucene v 7.7 - an upgrade is probably on the cards, but I do not believe this would solve my problem.

Any insight would be tremendously appreciated.


Solution

  • This was fixed after much trawling through Lucene documents and source code debugging. Here is the right way to write this query in Lucene:

    spanNot(title:united, spanOr([spanNear([title:united, title:states], 0, true), spanNear([title:united, title:kingdom], 0, true)]), 0, 0) spanNot(content:united, spanOr([spanNear([content:united, content:states], 0, true), spanNear([content:united, content:kingdom], 0, true)]), 0, 0) spanNot(summary:united, spanOr([spanNear([summary:united, summary:states], 0, true), spanNear([summary:united, summary:kingdom], 0, true)]), 0, 0)
    

    In case that's difficult to read, it's 3 separate queries (one for each field) made up of a spanNot with a term query include, and a spanOr exclude, which itself is comprised of two spanNear queries - one for each exlcusion term.

    The issue before was that there were too many combinations of exclusion terms and fields for any distribution of SHOULD and MUST. The right way to execute this search was one thorough query per field.