solr lucene information-retrieval inverted-index

Change Indexing(Postings)Structure of Lucene

I am doing research on new ways to index documents. Specifically I would like to change existing index structures to experiment indexing techniques. For example if Lucene has inverted index that saves term and Doc Id's at indexing time, I would like to extend that structure to save other information such as position or statistics about the term. How would I go about making such extensions? Is there a better open source project than Lucene for doing such extensions? Thanks.

Solution

For example if Lucene has inverted index that saves term and Doc Id's at indexing time, I would like to extend that structure to save other information such as position or statistics about the term...

Each postings entry in Lucene is very generic. Lucene already has provisions for saving generic types (objects) in the form of byte streams with the help of a payload object associated to each postings entry.

A very common use of the payload is to store term positions. For example, for a term t, if it occurs in documents D1 at positions 1 and 3, and at D2 in positions 2 and 5, you could save these as different entries in the postings for t, as shown below.

*t* => (D1,1) (D1,3) (D2, 2) (D2, 5)

The simplest way to do this would be through the use of the Lucene class DelimitedPayloadTokenFilter. While analyzing the text, all you need to do is to write out the term positions alongside each term, delimited by a specific character, e.g. '|', as shown in the following example.

class PayloadAnalyzer extends Analyzer {
private PayloadEncoder encoder;

PayloadAnalyzer(PayloadEncoder encoder) {
  this.encoder = encoder;
}

public TokenStream tokenStream(String fieldName, Reader reader) {
  TokenStream result = new WhitespaceTokenizer(reader);
  result = new LowerCaseFilter(result);
  result = new DelimitedPayloadTokenFilter(result, '|', encoder);
  return result;
}
}

For decoding the values stored in the payloads, you use something like the following.

class PayloadSimilarity extends DefaultSimilarity {
    @Override
    public float scorePayload(String fieldName, byte[] bytes, int offset, int length) {
      return PayloadHelper.decodeFloat(bytes, offset);
    }
}

You can then use the PayloadTermQuery class to make use of these term offsets during ranking of documents.

Thinking aloud, I think a good exercise for you would be to store other term specific information, could be - i) part-of-speech (POS) tags of terms, ii) word vectors of terms, etc., in the payload and use a combination of all these features during ranking.