Search code examples
lucenepunctuation

Lucene bigrams tokenizer to include punctuation signs


Is there any chance that I could use Lucene's ShingleAnalyzerWrapper to generate bigrams taking into account punctuation signs (i.e:.\,\;)? Quick example: given the field "one two; three four" would provide 2 bigrams only: (one two) and (three four)?


Solution

  • You could create a ShingleAnalyzerWrapper that uses an analyzer based on LetterTokenizer. LetterTokenizer breaks the input text at non letters. Something like:

    public class MyCharAnalyzer extends Analyzer { 
    
      public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream result = new LetterTokenizer(reader);    
        return result;
      }
    }
    
    ShingleAnalyzerWrapper myBigramWrapper = new ShingleAnalyzerWrapper(new MyCharAnalyzer());
    

    If you wanted better control over what you consider punctuation, you could subclass CharTokenizer and override the isTokenChar() method.