(More specific problem details are below in the update) I have really long document field values. Tokens of these fields are of the form: word|payload|position_increment. (I need to control position increments and payload manually.) I collect these compound tokens for the entire document, then join them with a '\t', and then pass this string to my custom analyzer. (For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).
The analyzer is just the following:
class AmbiguousTokenAnalyzer extends Analyzer {
private PayloadEncoder encoder = new IntegerEncoder();
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION, reader);
TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
sink.addAttribute(OffsetAttribute.class);
sink.addAttribute(CharTermAttribute.class);
sink.addAttribute(PayloadAttribute.class);
sink.addAttribute(PositionIncrementAttribute.class);
return new TokenStreamComponents(source, sink);
}
}
CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken' method where the rightmost "|aaa" part of a token is processed.
The field is configured as:
attributeFieldType.setIndexed(true);
attributeFieldType.setStored(true);
attributeFieldType.setOmitNorms(true);
attributeFieldType.setTokenized(true);
attributeFieldType.setStoreTermVectorOffsets(true);
attributeFieldType.setStoreTermVectorPositions(true);
attributeFieldType.setStoreTermVectors(true);
attributeFieldType.setStoreTermVectorPayloads(true);
The problem is, if I pass to the analyzer the field itself (one huge string - via document.add(...) ), it works OK, but if I pass token after token, something breaks at the search stage. As I read somewhere, these two ways must be the same from the resulting index point of view. Maybe my analyzer misses something?
UPDATE
Here is my problem in more detail: in addition to indexing, I need the multi-value field to be stored as-is. And if I pass it into the analyzer as multiple atomic tokens, it stores only the first of them. What do I need to do to my custom analyzer to make it store all the atomic tokens concatenated eventually?
Well, it turns out that all the values are actually stored. Here is what I get after indexing:
indexSearcher.doc(0).getFields("gramm")
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:V|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:PR|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|0|1000 S|1|0>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|1|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:ADV|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
And the "single-field" version
indexSearcher.doc(0).getField("gramm")
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
I don't know why getField() returns only the first value, but it seems that for my needs getFields() is OK.