Search code examples
javaluceneindexingphrase

Lucene java suggester for phrases without solr


I have a large DB with binary documents (like PDFs) and an index crated without TermFreqVector, just "Store.NO, Index.ANALYZED". I'm trying to implement phrase suggester/predictor using that. I would like to search for single and multiple words, like: "where" or "where are" and I expect to get something like "where are you john".

I'm surprised that LUKE is able to restore documents document term by term from created index somehow (I've checked its sources, but... I still don't know how it's possible without TermFreqVector). Is there anyone that knows how it's possible? I've got two options for my suggester:

1) Use 'somehow' LUKE's mechanism to restore a document from index I have now. (That would be the best).

2) Create another index just for phrase suggester. (However, currently implemented indexing takes about 2-3 days and about 4-5Gigs). I've searched over then net for the solution, but most of them lead to the SOLR which I can't use.

I've tried few solutions already but... I've stucked.

I would be grateful for any hints.


Solution

  • OK. After few retries taking different approach... I did that and it's working very fast. :) What I have done. I've Re-indexed my all documents with an addtional option "TermVector.WITH_POSITIONS" and I'm searching for terms directly in the index using PrefixQuery. Then I'm taking all positions of the term I'm searching for within the documents and storing it withing a map. Then I'm iterating over the document terms checking if the term position is TermPosition <= (number of suggested phrase).

    If you need examples, please ask :)