I'm using Lucene 8.6.2 (currently the latest available) with AdoptOpenJDK 11 on Windows 10, and I'm having odd problems with the Portuguese and Brazilian Portuguese analyzers mangling the tokenization.
Let's take a simple example: the first line of the chorus from Jorge Aragão's famous samba song, "Já É", first using a org.apache.lucene.analysis.standard.StandardAnalyzer
for reference.
Pra onde você for
String text = "Pra onde você for";
try (Analyzer analyzer = new StandardAnalyzer()) {
try (final TokenStream tokenStream = analyzer.tokenStream("text", text)) {
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
System.out.println("term: charTermAttribute.toString());
}
tokenStream.end();
}
}
This gives me following terms (collapsed to one line for readability):
pra onde você for
OK, that's pretty much what I would expect with any analyzer. But here is what I get if I use the org.apache.lucene.analysis.pt.PortugueseAnalyzer
instead, using the no-args constructor:
pra onde
Huh? Maybe it thinks that "você" ("you") and "for" ("may go") are stop words and removed them.
But now let's try the org.apache.lucene.analysis.br.BrazilianAnalyzer
, again using the no-args constructor:
pra ond voc for
Now that is just broken and mangled. It changed "onde" ("where") to "ond", which to my knowledge is not even a Portuguese word. And for "você" it just dropped the "ê".
Other lines are as bad or worse:
StandardAnalyzer
: a saudade é dor volta meu amor
PortugueseAnalyzer
: saudad é dor volt amor
BrazilianAnalyzer
: saudad é dor volt amor
Here you can see that the Portuguese and Brazilian Portuguese analyzers produced the same output—but it is the same broken output, as "volta" sure needs to stay "volta" (and not "volt") if I'm very going to get my love to come back to me.
Am I making some serious mistake with the Lucene core libraries and language analyzers? The output makes no sense, and I'm surprised that analyzers for such a common language would mangle the tokens like that.
Looking at the code for the PortugueseAnalyzer
and the BrazilianAnalyzer
, it looks like these analyzers are performing stemming. (I'm a little new to coding Lucene, so it's not something I expected.) So for indexing, maybe this is what the authors intended. Perhaps "você" is a stem for "você" and "vocês". And I guess "volt" is the stem of the verb (infinitive form) "voltar". (But "saudad" is not what I would expect for the stem of "saudade", but again, this aspect of text analysis is a bit new to me.)
For my particular use case, I just want to tokenize the words and skip stop words. I can't find a way to turn off stemming for the PortugueseAnalyzer
and the BrazilianAnalyzer
, so I guess I'll just use a StandardAnalyzer
but use the stop words from the language-specific analyzer, like this:
final Analyzer analyzer;
try (BrazilianAnalyzer ptBRAnalyzer = new BrazilianAnalyzer()) {
analyzer = new StandardAnalyzer(ptBRAnalyzer.getStopwordSet());
}
That's a little roundabout, but at least that gives me more what I was looking for:
StandardAnalyzer
: a saudade é dor volta meu amor
StandardAnalyzer
with PortugueseAnalyzer
stop words: saudade é dor volta amor
StandardAnalyzer
with BrazilianAnalyzer
stop words: saudade é dor volta meu amor
That's better. But apparently the Portuguese analyzer thinks "meu" is a stop word, even though the Brazilian analyzer does not. I would guess that the word for "my" pretty much means the same in Portugal Portuguese and Brazilian Portuguese; it seems odd the two analyzers would disagree on whether it should be a stop word by default.