Search code examples
javaluceneelasticsearch-analyzers

Strange tokenization in Lucene 8 Brazilian Portuguese analyzers


I'm using Lucene 8.6.2 (currently the latest available) with AdoptOpenJDK 11 on Windows 10, and I'm having odd problems with the Portuguese and Brazilian Portuguese analyzers mangling the tokenization.

Let's take a simple example: the first line of the chorus from Jorge Aragão's famous samba song, "Já É", first using a org.apache.lucene.analysis.standard.StandardAnalyzer for reference.

Pra onde você for

String text = "Pra onde você for";
try (Analyzer analyzer = new StandardAnalyzer()) {
  try (final TokenStream tokenStream = analyzer.tokenStream("text", text)) {
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      System.out.println("term: charTermAttribute.toString());
    }
    tokenStream.end();
  }
}

This gives me following terms (collapsed to one line for readability):

pra onde você for

OK, that's pretty much what I would expect with any analyzer. But here is what I get if I use the org.apache.lucene.analysis.pt.PortugueseAnalyzer instead, using the no-args constructor:

pra onde

Huh? Maybe it thinks that "você" ("you") and "for" ("may go") are stop words and removed them.

But now let's try the org.apache.lucene.analysis.br.BrazilianAnalyzer, again using the no-args constructor:

pra ond voc for

Now that is just broken and mangled. It changed "onde" ("where") to "ond", which to my knowledge is not even a Portuguese word. And for "você" it just dropped the "ê".

Other lines are as bad or worse:

  • Text: "A saudade é dor, volta meu amor"
  • StandardAnalyzer: a saudade é dor volta meu amor
  • PortugueseAnalyzer: saudad é dor volt amor
  • BrazilianAnalyzer: saudad é dor volt amor

Here you can see that the Portuguese and Brazilian Portuguese analyzers produced the same output—but it is the same broken output, as "volta" sure needs to stay "volta" (and not "volt") if I'm very going to get my love to come back to me.

Am I making some serious mistake with the Lucene core libraries and language analyzers? The output makes no sense, and I'm surprised that analyzers for such a common language would mangle the tokens like that.


Solution

  • Looking at the code for the PortugueseAnalyzer and the BrazilianAnalyzer, it looks like these analyzers are performing stemming. (I'm a little new to coding Lucene, so it's not something I expected.) So for indexing, maybe this is what the authors intended. Perhaps "você" is a stem for "você" and "vocês". And I guess "volt" is the stem of the verb (infinitive form) "voltar". (But "saudad" is not what I would expect for the stem of "saudade", but again, this aspect of text analysis is a bit new to me.)

    For my particular use case, I just want to tokenize the words and skip stop words. I can't find a way to turn off stemming for the PortugueseAnalyzer and the BrazilianAnalyzer, so I guess I'll just use a StandardAnalyzer but use the stop words from the language-specific analyzer, like this:

    final Analyzer analyzer;
    try (BrazilianAnalyzer ptBRAnalyzer = new BrazilianAnalyzer()) {
      analyzer = new StandardAnalyzer(ptBRAnalyzer.getStopwordSet());
    }
    

    That's a little roundabout, but at least that gives me more what I was looking for:

    • Text: "A saudade é dor, volta meu amor"
    • StandardAnalyzer: a saudade é dor volta meu amor
    • StandardAnalyzer with PortugueseAnalyzer stop words: saudade é dor volta amor
    • StandardAnalyzer with BrazilianAnalyzer stop words: saudade é dor volta meu amor

    That's better. But apparently the Portuguese analyzer thinks "meu" is a stop word, even though the Brazilian analyzer does not. I would guess that the word for "my" pretty much means the same in Portugal Portuguese and Brazilian Portuguese; it seems odd the two analyzers would disagree on whether it should be a stop word by default.