Search code examples
javaindexinglucenestemmingstop-words

How to combine Analyzer instances for stop word removal and stemming in Lucene (5.2.1)?


I'm using Lucene latest version 5.2.1.. While indexing the documents I want the stop words to be removed after that all the words should be stemmed to its root word.

There is EnglishAnalyzer available but stemming is not accurate. And there is StopAnalyzer which removes the stop words.

Do Lucene have any analyzer which does the both things ?

And I had written one custom analyzer for the purpose of Stemming using KStemFilter. How can I use existing StopAnalyzer in the custom analyzer


Solution

  • Yes, it's possible to combine different analyzers in Lucene together.

    You should use something like this:

    StringReader reader = new StringReader(text);
    Tokenizer whitespaceTokenizer = new WhitespaceTokenizer();
    whitespaceTokenizer.setReader(reader);
    TokenStream tokenStream = new StopFilter(whitespaceTokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
    tokenStream = new PorterStemFilter(tokenStream);
    

    where text is a variable, containing something you want to be analyzed. Here I combine whitespace tokenization (probably you could replace it with StandardAnalyzer which is more sophisticated) and then I remove stop words with StopFilter and later in analyzer chain use a PorterStemFilter (which also more better than just simple EnglishStemmer, also you could replace it with any TokenFilter you like.

    Complete example is available here - https://raw.githubusercontent.com/MysterionRise/information-retrieval-adventure/master/lucene4/src/main/java/org/mystic/StopWordsStemmingAnalyzer.java