Search code examples
luceneluke

Lucene search skips some results


I am trying to build an application that implements a search system over Lucene index. Right now the index is built, I can search for documents over the index and everything seems to be working fine but, when I make a search using a field that is used in many documents, the analyzer only returns some documents. I have tried to make the same search using Luke and is behaving the same way.

i.e: My index have 2 fields:

Field A: An identifier that is unique. Field B: A String.

First Example:

We have 5 documents:

Doc 1: FieldA:1; FieldB:hello world

Doc 2: FieldA:2; FieldB:hello world!

Doc 3: FieldA:3; FieldB:hello world

Doc 4: FieldA:4; FieldB:anything

Doc 5: FieldA:5; FieldB:hello world

When I make a search like "B: hello world" it should returns the documents 1, 3 and 5 but it only returns 1 and 3.

When I make a search like "A: 5" it returns the document 5 and the field B value is "hello world".

Second Example: (one token)

Doc 6: FieldA:6; FieldB:token

Doc 7: FieldA:7; FieldB:token

Doc 8: FieldA:8; FieldB:TOKEN

Doc 9: FieldA:9 FieldB:token

When I search FieldB:"token" it only returns Doc 6 and Doc 9. The only way I can find Doc 7 is searching by its FieldA.

I am using WhitespaceAnalyzer and both Fields are NOT_ANALYZED.

IndexGenerator Main

...

IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);;
writer.setRAMBufferSizeMB(200);

List<Work> works = getWorks(); //Retrieves the information from the DB

for (Work work: works) {

   Document luceneDocument = createLuceneDocument(work);
   writer.addDocument(luceneDocument);

}
writer.commit();

...

CreateLuceneDocument Method:

private static Document createLuceneDocument(Work work) {

 try {
   Document luceneDoc = new Document();

   ...

   Field id = new Field("ID", work.getId(),Field.Store.YES,Field.Index.NOT_ANALYZED);
   luceneDoc.add(id);

   Field name = new Field("NAME", work.getName(),Field.Store.YES,Field.Index.NOT_ANALYZED);
   luceneDoc.add(name);

   ...

   return document;

   }
   catch (LuceneException e) {
       ...
   }
}

I have noticed that the Documents that are not returned have a low score value. Assuming that is a problem when the index is created because Luke behaves the same way than the applicacion, what am I doing wrong?

Thanks in advance!


Solution

  • Lucene will resolve the search expression B:hello world to B:hello D:world, an expression of two terms. Here D is the default search field, probably "another Field" mentioned in your comment on @femtoRgon's answer.

    I'm guessing the results include documents 1 and 3 because they match on token "world" in field D, but this token is absent from document 5 field D. But this is possible only if the default search operator is OR not AND, because B:hello cannot match these documents.

    You may get the results you expect by using a phrase expression: B:"hello world". But you may not; WhitespaceAnalyzer will break this phrase into two tokens when it builds a Query object.

    You could get around the problem by usingKeywordAnalyzer for field B, as described in my answer to another question.