I have documents with "word" and "stem" features. One word may have several stems, so I index "stem" features manipulating position increments. I do that as follows:
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setOmitNorms(true);
type.setTokenized(true);
type.setStoreTermVectorOffsets(true);
type.setStoreTermVectorPositions(true);
type.setStoreTermVectors(true);
String join_token = tok.nextToken(); // token is like "stem1 stem2 stem3"
TokenStream stream = new WhitespaceTokenizer(Version.LUCENE_41, new StringReader(join_token));
PositionIncrementAttribute attr = stream.addAttribute(PositionIncrementAttribute.class);
attr.setPositionIncrement(0);
stream.addAttribute(OffsetAttribute.class);
stream.addAttribute(CharTermAttribute.class);
feature = new Field(name,
join_token,
type);
feature.setTokenStream(stream);
doc.add(feature);
You see in the code that I initialize Field with fixed String value for it to be stored, and then pass a token stream into it (I found that solution somewhere here at stackoverflow). I perform these exact steps for each join_token with stems. As the result, I watch at the TermVector of my words in Luke, and multiple stems for one single word appear at the consecutive (different!) while they should share one single position. What goes wrong?
It looks like your problem is that you aren't actually initializing the TokenStream
, so that when you do attr.setPositionIncrement(0);
it's not setting it for each token in the stream. If you wanted to do this manually, then you would have to iterate over each token in the stream and setPositionIncrement(0)
for each one.
However, you may want to look into using the PositionFilter instead. It will handle the setting of the position increment to 0 for you as the stream is consumed.
This would look like the following:
String join_token = tok.nextToken(); // token is like "stem1 stem2 stem3"
TokenStream stream = new WhitespaceTokenizer(Version.LUCENE_41, new StringReader(join_token));
stream = new PositionFilter(stream, 0); // 0 also happens to be the default