I have a text of file:
war force
force war
I do "split" and save word in TextWord:
TextWord[0]: war
TextWord[1]: force
TextWord[2]: force
TextWord[3]: war
I want to find only "war force", but my search also finds "force war". I want the search to take into account 2 rules:
I try this:
Query query = parser.parse(" \"war force\"~0x ");
Query query = parser.parse(" \"war force\"~0 ");
Query query = parser.parse("war AND force");
Query query = parser.parse("war force");
But such requests do not give the desired result, tell me how you can do this?
My code:
Analyzer customAnalyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.build();
QueryParser parser = new QueryParser("tags", customAnalyzer);
Query query = parser.parse("\"war force\" AND NOT \"force war\"");
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(query, 10);
System.out.println(" ");
FastVectorHighlighter highlighter = new FastVectorHighlighter();
FieldQuery fieldQuery = highlighter.getFieldQuery(query);
FieldTermStack stack = new FieldTermStack(reader, 0, "tags", fieldQuery);
TermInfo myTermInfo = stack.pop();
while(myTermInfo != null){
System.out.println("word[" + myTermInfo.getPosition() + "]: " + myTermInfo.getText());
myTermInfo = stack.pop();
}
My output:
word[0]: war
word[1]: force
word[4]: force
word[5]: war
The result I need:
word[0]: war
word[1]: force
I saw a documentation. If we have such a request: "Word1 Word2", and between these words there is no operator, then by default the OR operator is put. This means that the request "war force" will be equal to the request "force war", so it will be found: 1) "war force"; 2) "force war". And I don't know how to make sure that I have only this as a result: "war force". Tell me how to be? Am I missing something?
And if I use highlighter, I have result:
?<b>war</b> <b>force</b> bookcase bookcase1
force war
My code with highlighter:
Analyzer customAnalyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.build();
//... Above, create documents with two fields, one with term vectors (tv) and one without (notv)
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("tags", customAnalyzer);
Query query = parser.parse(" \"war force\"~0 ");
//Query query = parser.parse("*Case");
//Query query = new PrefixQuery(new Term("tags", "book")); //Поиск чтобы слово начиналось на строку "book" - "bookcase"
TopDocs hits = searcher.search(query, 10);
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("<b>", "</b>");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.scoreDocs.length; i++) {
int id = hits.scoreDocs[i].doc;
Document doc = searcher.doc(id);
String text = doc.get("tags");
TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "tags", customAnalyzer);
TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, true, 100);//highlighter.getBestFragments(tokenStream, text, 3, "...");
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
System.out.println((frag[j].toString()));
}
}
System.out.println("finish test");
}
But if I use highlighter, I don't have possition of found word.
To exclude a term or phrase, you can use the -
operator (the "prohibit" operator):
"war force" -"force war"
So, in Java, this would be:
Query query = parser.parse("\"war force\" -\"force war\"");
You can also use AND NOT
:
"war force" AND NOT "force war"
You can see more details in the classic query parser syntax documentation.
Update
The question has changed a lot since you first asked it!
Now there are 2 new problems:
Your query appears to be retrieving documents that it should not retrieve.
You cannot get the positions of matched terms.
Problem 1
I cannot recreate this problem. Let's assume I have 2 documents in my index:
Doc 1: State WEAPONRY war force word1 And force war Book WEAPONRY
Doc 2: State WEAPONRY war force 123 War WORD1 Force And war Book WEAPONRY
When I use the following query:
"war force" AND NOT "force war"
I find Doc 2, but not Doc 1 - which is correct.
I don't know why you are seeing incorrect/unexpected results. I guess it may be because your index contains unexpected data or may be using an unexpected indexing approach. There is nothing in the question which helps to explain this.
Problem 2
Now, your question contains two examples of using highlighters:
However, both of your code fragments will not report the positions of matched tokens. To do that you can use the approach shown in this answer:
Lucene how can I get position of found query?
When I use that approach, and use the same data and query as shown above, I get the following results:
Found term: war
Position: 3
Found term: force
Position: 4
And, again, this is correct: The matched terms are the 3rd and 4th words in the found document.