This question says look at this question... but unfortunately these clever people's solutions no longer seem to work with Lucene 6, because the signature of createComponents
is now
TokenStreamComponents createComponents(final String fieldName)...
i.e. the Reader
is no longer supplied.
Anyone know what the present technique should be? Are we meant to make the Reader
a field of the Analyzer
class?
NB I don't actually want to filter anything, I want to get hold of the streams of tokens in order to create my own data structure (for frequency analysis and sequence-matching). So the idea is to use Lucene's Analyzer
technology to produce different models of the corpus. A trivial example might be: one model where everything is lower-cased, another where casing is left as in the corpus.
PS I also saw this question: but once again we have to supply a Reader
: i.e. I'm assuming that the context was tokenising for the purpose of querying. When writing an index, although clearly the Analyzers
in early versions were getting a Reader
from somewhere when createComponents
was called, you don't yet have a Reader
(that I know of...)
Got it, again using the technique in the referenced question... which is essentially to "interfere" in some way with the battery of Filters
which are applied during the crucial method of Analyzer
: createComponents
.
Thus, my doctored version of an EnglishAnalyzer
:
private int nTerm = 0; // field added by me
@Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new EnglishPossessiveFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
if (!stemExclusionSet.isEmpty())
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
result = new PorterStemFilter(result);
// my modification starts here:
class ExamineFilter extends FilteringTokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public ExamineFilter( TokenStream in ) {
super( in);
}
@Override
protected boolean accept() throws IOException {
String term = new String( termAtt.buffer(), 0, termAtt.length() );
printOut( String.format( "# term %d |%s|", nTerm, term ));
// do all sorts of things with this term...
nTerm++;
return true;
}
}
class MyTokenStreamComponents extends TokenStreamComponents {
MyTokenStreamComponents( Tokenizer source, TokenStream result ){
super( source, result );
}
public TokenStream getTokenStream(){
// reset term count at start of each Document
nTerm = 0;
return super.getTokenStream();
}
}
result = new ExamineFilter( result );
return new MyTokenStreamComponents(source, result);
//
}
The results, with input:
String[] contents = { "Humpty Dumpty sat on a wall,", "Humpty Dumpty had a great fall.", ...
are wonderful:
# term 0 |humpti|
# term 1 |dumpti|
# term 2 |sat|
# term 3 |wall|
# term 0 |humpti|
# term 1 |dumpti|
# term 2 |had|
# term 3 |great|
# term 4 |fall|
# term 0 |all|
# term 1 |king|
# term 2 |hors|
# term 3 |all|
# term 4 |king|
# term 5 |men|
...