Search code examples
javalucenetokenizen-gramstop-words

Lucene Stopword and nGram


I'm using Lucene and I want to use nGrams with stopwords.

I wrote an own Analyzer in Lucene with respect to the German Stopword Analyzer.

public class GermanNGramAnalyzer extends StopwordAnalyzerBase {

    @Override
    protected TokenStreamComponents createComponents(String s) {
        NGramTokenizer tokenizer = new NGramTokenizer(4,4); //Tokenizer for nGrams
        TokenStream result = new StandardFilter(tokenizer);
        result = new LowerCaseFilter(result); 
        result = new StopFilter(result, this.stopwords);
        result = new SetKeywordMarkerFilter(result, this.exclusionSet);
        result = new GermanNormalizationFilter(result);
        result = new NumberFilter(result);
        return new TokenStreamComponents(tokenizer, result);
    }
(...)
}

This works but not like I want. As you can see we have 4 grams so it looks like this: (blanks are masked as "_")

Das Haus
das_
as_h
s_ha
_hau
haus

In German "das" is like "the" and should be removed. But of course it won't be removed then "das_", "as_h", "s_ha" doesn't contains "das" at all.

So I want to first a word tokenizer, use stopword and after that merge everything again and use ngram like normal.

Of course I can "manually" remove all stopwords from the string before I throw it into Lucene but I thought it should be possible to do this with Lucene.

Someone has an idea?


Solution

  • One of the possibility would be instead of using NGramTokenizer as a tokenizer, first you could use StandardTokenizer or any other nice tokenization and then apply creation of the ngrams via usage of NGramTokenFilter which could be applied exactly after the usage of StopFilter.