Search code examples
javalucenetokenizeanalyzer

Lucene 6 - how to intercept tokenizing when writing an index?


This question says look at this question... but unfortunately these clever people's solutions no longer seem to work with Lucene 6, because the signature of createComponents is now

TokenStreamComponents createComponents(final String fieldName)...

i.e. the Reader is no longer supplied.

Anyone know what the present technique should be? Are we meant to make the Reader a field of the Analyzer class?

NB I don't actually want to filter anything, I want to get hold of the streams of tokens in order to create my own data structure (for frequency analysis and sequence-matching). So the idea is to use Lucene's Analyzer technology to produce different models of the corpus. A trivial example might be: one model where everything is lower-cased, another where casing is left as in the corpus.

PS I also saw this question: but once again we have to supply a Reader: i.e. I'm assuming that the context was tokenising for the purpose of querying. When writing an index, although clearly the Analyzers in early versions were getting a Reader from somewhere when createComponents was called, you don't yet have a Reader (that I know of...)


Solution

  • Got it, again using the technique in the referenced question... which is essentially to "interfere" in some way with the battery of Filters which are applied during the crucial method of Analyzer: createComponents.

    Thus, my doctored version of an EnglishAnalyzer:

    private int nTerm = 0; // field added by me
    
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        final Tokenizer source = new StandardTokenizer();
        TokenStream result = new StandardFilter(source);
        result = new EnglishPossessiveFilter(result);
        result = new LowerCaseFilter(result);
        result = new StopFilter(result, stopwords);
        if (!stemExclusionSet.isEmpty())
            result = new SetKeywordMarkerFilter(result, stemExclusionSet);
        result = new PorterStemFilter(result);
    
        // my modification starts here:
        class ExamineFilter extends FilteringTokenFilter {
            private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
            public ExamineFilter( TokenStream in ) {
                    super(  in);
              }         
            @Override
            protected boolean accept() throws IOException {
                String term = new String( termAtt.buffer(), 0, termAtt.length() );
                printOut( String.format( "# term %d |%s|", nTerm, term ));
    
                // do all sorts of things with this term... 
    
                nTerm++;
                return true;
            }
        }
        class MyTokenStreamComponents extends TokenStreamComponents {
            MyTokenStreamComponents( Tokenizer source, TokenStream result ){
                super( source, result );
            }
            public TokenStream getTokenStream(){
                // reset term count at start of each Document
                nTerm = 0;
                return super.getTokenStream();
            }
        }
        result = new ExamineFilter( result );
        return new MyTokenStreamComponents(source, result);
        //
    }
    

    The results, with input:

        String[] contents = { "Humpty Dumpty sat on a wall,", "Humpty Dumpty had a great fall.", ... 
    

    are wonderful:

    # term 0 |humpti|
    # term 1 |dumpti|
    # term 2 |sat|
    # term 3 |wall|
    
    # term 0 |humpti|
    # term 1 |dumpti|
    # term 2 |had|
    # term 3 |great|
    # term 4 |fall|
    
    # term 0 |all|
    # term 1 |king|
    # term 2 |hors|
    # term 3 |all|
    # term 4 |king|
    # term 5 |men|
    

    ...