Search code examples
javaattributeslucenetokentokenize

How to get a Token from a Lucene TokenStream?


I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.

The worst part is that I'm looking at the comments in the JavaDocs that address my question.

http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29

Somehow, an AttributeSource is supposed to be used, rather than Tokens. I'm totally at a loss.

Can anyone explain how to get token-like information from a TokenStream?


Solution

  • Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:

    TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
    OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
    TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
    
    while (tokenStream.incrementToken()) {
        int startOffset = offsetAttribute.startOffset();
        int endOffset = offsetAttribute.endOffset();
        String term = termAttribute.term();
    }
    

    Edit: The new way

    According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.

    TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
    OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        int startOffset = offsetAttribute.startOffset();
        int endOffset = offsetAttribute.endOffset();
        String term = charTermAttribute.toString();
    }