Search code examples
javaluceneindexingsearch-engine

Lucene multi-value field indexing


(More specific problem details are below in the update) I have really long document field values. Tokens of these fields are of the form: word|payload|position_increment. (I need to control position increments and payload manually.) I collect these compound tokens for the entire document, then join them with a '\t', and then pass this string to my custom analyzer. (For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).

The analyzer is just the following:

class AmbiguousTokenAnalyzer extends Analyzer {
    private PayloadEncoder encoder = new IntegerEncoder();

    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION, reader);
        TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
        sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
        sink.addAttribute(OffsetAttribute.class);
        sink.addAttribute(CharTermAttribute.class);
        sink.addAttribute(PayloadAttribute.class);
        sink.addAttribute(PositionIncrementAttribute.class);
        return new TokenStreamComponents(source, sink);
    }
}

CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken' method where the rightmost "|aaa" part of a token is processed.

The field is configured as:

attributeFieldType.setIndexed(true);
attributeFieldType.setStored(true);
attributeFieldType.setOmitNorms(true);
attributeFieldType.setTokenized(true);
attributeFieldType.setStoreTermVectorOffsets(true);
attributeFieldType.setStoreTermVectorPositions(true);
attributeFieldType.setStoreTermVectors(true);
attributeFieldType.setStoreTermVectorPayloads(true);

The problem is, if I pass to the analyzer the field itself (one huge string - via document.add(...) ), it works OK, but if I pass token after token, something breaks at the search stage. As I read somewhere, these two ways must be the same from the resulting index point of view. Maybe my analyzer misses something?

UPDATE

Here is my problem in more detail: in addition to indexing, I need the multi-value field to be stored as-is. And if I pass it into the analyzer as multiple atomic tokens, it stores only the first of them. What do I need to do to my custom analyzer to make it store all the atomic tokens concatenated eventually?


Solution

  • Well, it turns out that all the values are actually stored. Here is what I get after indexing:

    indexSearcher.doc(0).getFields("gramm")
    
    stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
    stored,indexed,tokenized,termVector,omitNorms<gramm:V|1|1>
    stored,indexed,tokenized,termVector,omitNorms<gramm:PR|1|1>
    stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1>
    stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|0|1000 S|1|0>
    stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
    stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|1|1000>
    stored,indexed,tokenized,termVector,omitNorms<gramm:ADV|1|1>
    stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
    

    And the "single-field" version

    indexSearcher.doc(0).getField("gramm")
    
    stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
    

    I don't know why getField() returns only the first value, but it seems that for my needs getFields() is OK.