Now I have several Lucene index sets (I call it shards), which indexes different document sets. They are independent, which means I can perform search on each of them without reading others. Then I get a query request. I want to search it over every index set and combine the result to form the final top documents.
I know that when scoring the documents, Lucene needs to know the <idf> of every term, and different index sets will give different <idf> to the same term (because different index sets hold different document sets). Thus to my understanding, I cannot compare the document score from different index sets directly. Then how should I generate the final result?
An obvious solution would be first merge the index and then perform the search over the big index. However, this is tooo time-consuming for me and thus unacceptable. Anyone has other better solutions?
P.S.: I don't want to use any packages or softwares (like Katta) except Lucene and Hadoop.
I think MultiReader is what you are looking for. If you have multiple IndexReaders, say reader1
and reader2
:
MultiReader multiReader = new MultiReader(reader1, reader2);
IndexSearcher searcher = new IndexSearcher(multiReader);